Reliable Benchmarking

Benchmark results can vary significantly due to system-level factors. This guide covers best practices for obtaining reproducible and reliable measurements.

What ZeroPyBench Does Automatically

zeropybench already implements several best practices:

  • Multiple repetitions with median: Reduces the impact of outliers

  • Auto-scaling: Automatically determines the number of iterations for reliable measurements

  • JAX compilation separation: Reports compilation time separately from execution time

  • Proper synchronization: Uses block_until_ready() for accurate JAX timing

CPU Benchmarking

Disable Frequency Scaling

Modern CPUs dynamically adjust their frequency based on load and temperature. This can cause significant variance in benchmark results.

# Set the CPU governor to performance mode (requires root)
sudo cpupower frequency-set -g performance

# Verify the setting
cpupower frequency-info

To revert to the default:

sudo cpupower frequency-set -g powersave  # or ondemand

Disable Turbo Boost

Turbo boost can cause inconsistent results as the CPU may throttle under sustained load.

Intel CPUs:

# Disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

# Re-enable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

AMD CPUs:

# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# Re-enable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/cpufreq/boost

Warning

These settings require root privileges and will be reset after reboot unless made persistent.

CPU Isolation

Isolate CPU cores to prevent the OS scheduler from interrupting your benchmark.

Runtime isolation with taskset:

# Run on cores 0-3 only
taskset -c 0-3 python benchmark.py

Boot-time isolation (more effective):

Warning

Modifying GRUB parameters incorrectly can prevent your system from booting. Always keep a backup boot option.

Add to your kernel boot parameters in /etc/default/grub:

GRUB_CMDLINE_LINUX="isolcpus=0-3 nohz_full=0-3"

Then update GRUB and reboot:

sudo update-grub
sudo reboot

Process Priority

Increase the priority of your benchmark process:

# Run with highest priority (requires root)
sudo nice -n -20 python benchmark.py

# Or with real-time scheduling
sudo chrt -f 99 python benchmark.py

Disable Hyperthreading

Hyperthreading can introduce variability. Disable it in BIOS or at runtime:

# Disable hyperthreading (example for 8 physical cores with HT)
echo 0 | sudo tee /sys/devices/system/cpu/cpu{8..15}/online

GPU Benchmarking (NVIDIA)

Enable Persistence Mode

Keeps the GPU initialized between runs, reducing startup overhead:

sudo nvidia-smi -pm 1

Lock GPU Clocks

Prevent dynamic frequency scaling on the GPU:

# Query supported clocks
nvidia-smi -q -d SUPPORTED_CLOCKS

# Lock graphics clocks (example: 1500 MHz)
sudo nvidia-smi -lgc 1500,1500

# Lock memory clocks (example: 5001 MHz)
sudo nvidia-smi -lmc 5001

# Reset to default
sudo nvidia-smi -rgc
sudo nvidia-smi -rmc

Exclusive Process Mode

Ensure only one process can use the GPU:

# Set exclusive process mode
sudo nvidia-smi -c EXCLUSIVE_PROCESS

# Reset to default (shared mode)
sudo nvidia-smi -c DEFAULT

Disable ECC Memory (Optional)

On GPUs with ECC memory, disabling it can provide ~10% more memory bandwidth. This is a persistent setting that requires a reboot:

# Check current ECC status
nvidia-smi -q | grep -i ecc

# Disable ECC (requires reboot)
sudo nvidia-smi -e 0

Monitor GPU State

Before running benchmarks, verify GPU state:

# Check temperatures, clocks, and utilization
nvidia-smi -q -d PERFORMANCE

# Monitor in real-time
watch -n 1 nvidia-smi

XLA GPU Autotuning

When benchmarking JAX code on GPU, XLA may spend a prohibitive time autotuning kernels (matrix multiplications, convolutions, etc.) during compilation. This autotuning overhead happens for every compilation, so when comparing methods where only some benefit from autotuning, the compilation times may not be comparable.

The autotune level maps to the --xla_gpu_autotune_level XLA flag:

  • 0: No autotuning. Fastest compilation, may use suboptimal kernels.

  • 1–4: Increasing autotuning effort. Higher levels try more algorithm variants, increasing compilation time but potentially finding faster kernels.

Important

XLA reads the XLA_FLAGS environment variable once, when the JAX backend is initialized (typically at the first JAX operation). Changing it at runtime has no effect. It must be set before importing JAX:

import os
os.environ['XLA_FLAGS'] = '--xla_gpu_autotune_level=0'

import jax  # backend initialization reads XLA_FLAGS here

Or from the shell:

XLA_FLAGS=--xla_gpu_autotune_level=0 python benchmark.py

Environment Variables

JAX-specific

# Disable JAX memory preallocation (useful for memory profiling)
export XLA_PYTHON_CLIENT_PREALLOCATE=false

# Set specific GPU
export CUDA_VISIBLE_DEVICES=0

# Disable JAX compilation cache (for cold-start benchmarks)
export JAX_ENABLE_COMPILATION_CACHE=false

General Python

# Disable Python's hash randomization for reproducibility
export PYTHONHASHSEED=0

Quick Setup Script

Here’s a script that applies common optimizations:

#!/bin/bash
# setup_benchmark_env.sh - Run as root

set -e

echo "Setting up benchmark environment..."

# CPU optimizations
cpupower frequency-set -g performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || \
echo 0 > /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || \
echo "Could not disable turbo boost"

# GPU optimizations (if NVIDIA GPU present)
if command -v nvidia-smi &> /dev/null; then
    nvidia-smi -pm 1
    # Optionally lock clocks here
fi

echo "Benchmark environment ready."
echo "Run your benchmark with: taskset -c 0-3 nice -n -20 python benchmark.py"

Verification Checklist

Before running benchmarks, verify:

  • CPU governor is set to performance

  • Turbo boost is disabled

  • GPU clocks are locked (for GPU benchmarks)

  • No other intensive processes are running

  • System temperature is stable

  • Sufficient warm-up iterations have been run

Interpreting Results

Tip

Use verbose=True to inspect the actual code being benchmarked and verify it matches your expectations.

Even with all optimizations, some variance is expected:

  • < 1% variance: Excellent, highly reproducible

  • 1-5% variance: Good, typical for well-controlled environments

  • 5-10% variance: Acceptable, may indicate some system noise

  • > 10% variance: Investigate system configuration

zeropybench reports the interquartile range (IQR) as a percentage, which helps identify unstable measurements.