Skip to content

Statistics Deep Dive

This page explains exactly how NBenchmark collects and analyses measurements. You don't need to understand all of this to use the library - the Key Concepts page covers the practical side. This is for readers who want the full mathematical picture.

The measurement loop

For each benchmark, NBenchmark runs the following sequence:

  1. Warmup - run the action WarmupIterations times (default: 25) without recording timings.
  2. Post-warmup GC - force a full gen-2 GC collection to establish a clean heap baseline.
  3. Measurement loop - for each of the Iterations (default: 200) measured runs:
    • If ForceGcBeforeEachIteration is true, force a gen-0 collection.
    • Call iterationSetup if provided.
    • Record Stopwatch.GetTimestamp().
    • Invoke the benchmark action.
    • Calculate elapsed time with Stopwatch.GetElapsedTime(timestamp).
    • Record allocation delta if MeasureAllocations is true.
    • Call iterationTeardown if provided.

Important: the timer is read immediately after the action returns, before teardown runs. Teardown time is not included in the measurement.

The raw timing data (in nanoseconds) is stored in a double[] of length Iterations.

Timer resolution

NBenchmark uses System.Diagnostics.Stopwatch, which wraps the platform's high-resolution performance counter. The resolution is printed at the start of each BenchmarkHost run:

Timer resolution: 1,000,000,000 ticks/s (1.00 ns per tick)

On most modern hardware the resolution is 1 ns. On some virtual machines it may be coarser.

Outlier trimming

After collection, outliers are removed according to OutlierMode. The samples are first sorted ascending.

ModeAlgorithm
NoneNo trimming.
RemoveTop5PercentDiscard the top ceil(n × 0.05) samples. Equivalent to keeping floor(n × 0.95).
RemoveTopAndBottom5PercentDiscard the top and bottom floor(n × 0.05) samples from each end.
IqrFenceCompute Q1, Q3, and IQR = Q3 − Q1. Discard any sample above Q3 + 1.5 × IQR or below Q1 − 1.5 × IQR.

The trimmed array is passed to StatsSummary.Compute. The pre-trim raw array is stored separately for use in significance testing.

Quartile definition

IqrFence computes Q1 and Q3 with the same nearest-rank percentile used everywhere else in NBenchmark (equivalent to numpy.percentile(method='inverted_cdf')). This deliberately differs from R's default type = 7 linear interpolation: for a 1..20 ramp NBenchmark gives Q1 = 5, Q3 = 15, whereas R type 7 gives Q1 = 5.75, Q3 = 15.25. The choice keeps every quantile in the library consistent and is pinned by OutlierModeCrossCheckTests.

Descriptive statistics

Given a sorted, trimmed array of n samples:

Mean

x¯=1ni=1nxi

Median

The nearest-rank method. For sorted sample index i = ceil(0.5 × n) (1-based). Equivalent to the middle value for odd n, and the lower-middle for even n.

Percentiles (P95, P99)

Also nearest-rank: i = ceil(p × n).

Min and Max

samples[0] and samples[n-1] of the sorted, trimmed array.

Sample standard deviation (Bessel's correction)

s=1n1i=1n(xix¯)2

The n-1 denominator (Bessel's correction) makes s an unbiased estimator of the population standard deviation. For n = 1, the standard deviation is reported as 0.

Standard error of the mean

SEM=sn

SEM measures how precisely the mean is estimated. For n = 1, SEM is 0.

Confidence interval on the mean

The margin of error is the half-width of the confidence interval:

MoE=tα/2,n1×SEM

where tα/2,n1 is the two-tailed critical value of Student's t-distribution at the configured confidence level and n − 1 degrees of freedom.

The confidence interval is:

x¯±MoE=[x¯MoE,x¯+MoE]

Why Student's t and not the normal distribution?

The normal distribution's critical value (e.g. 1.96 for 95%) assumes the population standard deviation is known. In benchmarking it is not - we estimate it from the sample. Student's t compensates by using wider critical values for small sample sizes, shrinking towards the normal as n grows.

With the default 200 iterations (190 after 5% trimming), the t critical value at 95% is approximately 1.973 - very close to the normal 1.960, so the practical difference is small.

Honest caveats

The CI is on the mean and relies on the Central Limit Theorem - the assumption that the sample mean is approximately normally distributed. For n ≥ 30 this is generally safe even when the underlying distribution is not normal. For very small sample counts (e.g. a parameterised benchmark with 10 iterations) the approximation is weaker, but the t-distribution's heavier tails at low degrees of freedom provide some protection.

t-critical values in practice

Confidence leveln = 10 (df=9)n = 30 (df=29)n = 200 (df=199)Normal (df=∞)
90%1.8331.6991.6521.645
95%2.2622.0451.9721.960
99%3.2502.7562.6012.576

Dependency-free implementation

NBenchmark computes the t critical value without any external libraries using exact closed forms for df = 1 and df = 2, and the Cornish-Fisher expansion (Abramowitz & Stegun §26.7.5) for df ≥ 3. The normal quantile uses Acklam's rational approximation (max error < 1.15 × 10⁻⁹).

These approximations are cross-checked against SciPy on every build: the t critical value matches scipy.stats.t.ppf to machine precision for df = 1, 2 and to better than 1% for df ≥ 3 (worst case ≈ 0.79% at df = 3, 99%). See Validation & Accuracy for the full tolerance table.

Coefficient of variation

CV=sx¯

A dimensionless relative measure of variability. A CV of 0.05 means the standard deviation is 5% of the mean - the benchmark is fairly stable. A CV of 0.5 or higher indicates high variability and the results should be treated with caution.

Allocation measurement

When MeasureAllocations = true, each iteration records:

beforeThreadId    = CurrentManagedThreadId
beforeThreadBytes = GC.GetAllocatedBytesForCurrentThread()
beforeProcess     = GC.GetTotalAllocatedBytes()
// action runs
if CurrentManagedThreadId == beforeThreadId:
   allocations[i] = Max(0, GC.GetAllocatedBytesForCurrentThread() - beforeThreadBytes)
else:
   allocations[i] = Max(0, GC.GetTotalAllocatedBytes() - beforeProcess)

The reported MeanAllocatedBytes is the arithmetic mean across all iterations. This includes any allocations made by the benchmark framework itself that appear between the two reads - in practice, for simple benchmarks, this is usually negligible.

In synchronous benchmarks this is thread-local (GC.GetAllocatedBytesForCurrentThread) and does not include allocations from other threads. In async benchmarks, if the continuation hops threads, NBenchmark falls back to process-wide delta for that sample, which can include background allocation noise.

Statistical significance: Mann-Whitney U test

When two or more benchmarks have been run, NBenchmark tests whether the difference in their distributions is statistically significant using the Mann-Whitney U test (also called the Wilcoxon rank-sum test).

Why Mann-Whitney U?

Benchmark timings are typically right-skewed (a few slow outliers) and do not follow a normal distribution. Parametric tests like the t-test assume normality. The Mann-Whitney U test is non-parametric - it ranks combined values rather than computing moments, and makes no distributional assumptions.

Algorithm

Given the pre-trim raw samples of two benchmarks A (length n₁) and B (length n₂):

  1. Merge and sort all n₁ + n₂ values together, recording which sample each came from.
  2. Assign mid-ranks to tied values: all tied observations share the average rank of the positions they occupy.
  3. Compute the rank sum for group A: R1=rank(Ai).
  4. Compute the U statistics:
U1=R1n1(n1+1)2,U2=n1n2U1,U=min(U1,U2)
  1. For large samples (n₁ ≥ 5 and n₂ ≥ 5), use the normal approximation with a tie correction to compute a z-score, then derive a two-tailed p-value.

A p-value below 0.05 is considered significant (✓ in the Sig column). This threshold is fixed and is not configurable.

The normal approximation uses no continuity correction, so it corresponds to scipy.stats.mannwhitneyu(..., method='asymptotic', use_continuity=False) - which NBenchmark matches to better than 1e-6. On small samples this approximation can differ from the exact permutation p-value by up to ≈ 0.05; that gap is pinned and documented in Validation & Accuracy.

NOTE

NBenchmark uses the pre-trim raw samples (before outlier removal) for significance testing. This gives the test more data to work with. However it means that significance is assessed on the full distribution including extreme measurements.

Minimum sample requirement

The test requires at least 5 samples in each group. With fewer samples the normal approximation is unreliable and the test returns null (no significance indicator is shown).

Summary of all reported statistics

FieldFormulaDescription
MedianNearest-rank P50Robust central tendency.
Meanx¯=1nxiArithmetic average.
P95Nearest-rank P9595th percentile.
P99Nearest-rank P9999th percentile.
Minx1 (sorted)Fastest measured sample.
Maxxn (sorted)Slowest measured sample.
StandardDeviations=1n1(xix¯)2Spread of measurements (Bessel).
StandardErrors/nPrecision of the mean estimate.
MarginOfErrort×SEMHalf-width of CI on the mean.
ConfidenceIntervalLowerx¯MoELower CI bound.
ConfidenceIntervalUpperx¯+MoEUpper CI bound.
CoefficientOfVariations/x¯Relative variability.
PValueMann-Whitney UTwo-tailed p-value vs. baseline.
SignificanceVerdictp<0.05Whether the difference is real (Significant, NotSignificant, or NotTested).
MeanAllocatedBytesMean of iteration deltasMean heap allocation per iteration.

Released under the MIT License.