Reliable Benchmarking¶
Benchmark results can vary significantly due to system-level factors. This guide covers best practices for obtaining reproducible and reliable measurements.
What ZeroPyBench Does Automatically¶
zeropybench already implements several best practices:
Multiple repetitions with median: Reduces the impact of outliers
Auto-scaling: Automatically determines the number of iterations for reliable measurements
JAX compilation separation: Reports compilation time separately from execution time
Proper synchronization: Uses
block_until_ready()for accurate JAX timing
CPU Benchmarking¶
Disable Frequency Scaling¶
Modern CPUs dynamically adjust their frequency based on load and temperature. This can cause significant variance in benchmark results.
# Set the CPU governor to performance mode (requires root)
sudo cpupower frequency-set -g performance
# Verify the setting
cpupower frequency-info
To revert to the default:
sudo cpupower frequency-set -g powersave # or ondemand
Disable Turbo Boost¶
Turbo boost can cause inconsistent results as the CPU may throttle under sustained load.
Intel CPUs:
# Disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# Re-enable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
AMD CPUs:
# Disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# Re-enable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/cpufreq/boost
Warning
These settings require root privileges and will be reset after reboot unless made persistent.
CPU Isolation¶
Isolate CPU cores to prevent the OS scheduler from interrupting your benchmark.
Runtime isolation with taskset:
# Run on cores 0-3 only
taskset -c 0-3 python benchmark.py
Boot-time isolation (more effective):
Warning
Modifying GRUB parameters incorrectly can prevent your system from booting. Always keep a backup boot option.
Add to your kernel boot parameters in /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=0-3 nohz_full=0-3"
Then update GRUB and reboot:
sudo update-grub
sudo reboot
Process Priority¶
Increase the priority of your benchmark process:
# Run with highest priority (requires root)
sudo nice -n -20 python benchmark.py
# Or with real-time scheduling
sudo chrt -f 99 python benchmark.py
Disable Hyperthreading¶
Hyperthreading can introduce variability. Disable it in BIOS or at runtime:
# Disable hyperthreading (example for 8 physical cores with HT)
echo 0 | sudo tee /sys/devices/system/cpu/cpu{8..15}/online
GPU Benchmarking (NVIDIA)¶
Enable Persistence Mode¶
Keeps the GPU initialized between runs, reducing startup overhead:
sudo nvidia-smi -pm 1
Lock GPU Clocks¶
Prevent dynamic frequency scaling on the GPU:
# Query supported clocks
nvidia-smi -q -d SUPPORTED_CLOCKS
# Lock graphics clocks (example: 1500 MHz)
sudo nvidia-smi -lgc 1500,1500
# Lock memory clocks (example: 5001 MHz)
sudo nvidia-smi -lmc 5001
# Reset to default
sudo nvidia-smi -rgc
sudo nvidia-smi -rmc
Exclusive Process Mode¶
Ensure only one process can use the GPU:
# Set exclusive process mode
sudo nvidia-smi -c EXCLUSIVE_PROCESS
# Reset to default (shared mode)
sudo nvidia-smi -c DEFAULT
Disable ECC Memory (Optional)¶
On GPUs with ECC memory, disabling it can provide ~10% more memory bandwidth. This is a persistent setting that requires a reboot:
# Check current ECC status
nvidia-smi -q | grep -i ecc
# Disable ECC (requires reboot)
sudo nvidia-smi -e 0
Monitor GPU State¶
Before running benchmarks, verify GPU state:
# Check temperatures, clocks, and utilization
nvidia-smi -q -d PERFORMANCE
# Monitor in real-time
watch -n 1 nvidia-smi
XLA GPU Autotuning¶
When benchmarking JAX code on GPU, XLA may spend a prohibitive time autotuning kernels (matrix multiplications, convolutions, etc.) during compilation. This autotuning overhead happens for every compilation, so when comparing methods where only some benefit from autotuning, the compilation times may not be comparable.
The autotune level maps to the --xla_gpu_autotune_level XLA flag:
0: No autotuning. Fastest compilation, may use suboptimal kernels.
1–4: Increasing autotuning effort. Higher levels try more algorithm variants, increasing compilation time but potentially finding faster kernels.
Important
XLA reads the XLA_FLAGS environment variable once, when the JAX backend is initialized (typically at the first JAX operation).
Changing it at runtime has no effect. It must be set before importing JAX:
import os
os.environ['XLA_FLAGS'] = '--xla_gpu_autotune_level=0'
import jax # backend initialization reads XLA_FLAGS here
Or from the shell:
XLA_FLAGS=--xla_gpu_autotune_level=0 python benchmark.py
Environment Variables¶
JAX-specific¶
# Disable JAX memory preallocation (useful for memory profiling)
export XLA_PYTHON_CLIENT_PREALLOCATE=false
# Set specific GPU
export CUDA_VISIBLE_DEVICES=0
# Disable JAX compilation cache (for cold-start benchmarks)
export JAX_ENABLE_COMPILATION_CACHE=false
General Python¶
# Disable Python's hash randomization for reproducibility
export PYTHONHASHSEED=0
Quick Setup Script¶
Here’s a script that applies common optimizations:
#!/bin/bash
# setup_benchmark_env.sh - Run as root
set -e
echo "Setting up benchmark environment..."
# CPU optimizations
cpupower frequency-set -g performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || \
echo 0 > /sys/devices/system/cpu/cpufreq/boost 2>/dev/null || \
echo "Could not disable turbo boost"
# GPU optimizations (if NVIDIA GPU present)
if command -v nvidia-smi &> /dev/null; then
nvidia-smi -pm 1
# Optionally lock clocks here
fi
echo "Benchmark environment ready."
echo "Run your benchmark with: taskset -c 0-3 nice -n -20 python benchmark.py"
Verification Checklist¶
Before running benchmarks, verify:
CPU governor is set to
performanceTurbo boost is disabled
GPU clocks are locked (for GPU benchmarks)
No other intensive processes are running
System temperature is stable
Sufficient warm-up iterations have been run
Interpreting Results¶
Tip
Use verbose=True to inspect the actual code being benchmarked and verify it matches your expectations.
Even with all optimizations, some variance is expected:
< 1% variance: Excellent, highly reproducible
1-5% variance: Good, typical for well-controlled environments
5-10% variance: Acceptable, may indicate some system noise
> 10% variance: Investigate system configuration
zeropybench reports the interquartile range (IQR) as a percentage, which helps identify unstable measurements.