punit

punit is a JUnit 5 extension framework for probabilistic testing. It is designed for systems where behaviour is non-deterministic by nature — LLM integrations, ML model inference, distributed systems, and randomised algorithms.

How it works

Instead of the traditional binary pass/fail model, punit executes a test multiple times and treats each run as a Bernoulli trial. It then applies statistical inference to determine whether the observed success rate meets a defined threshold at a given confidence level.

Key capabilities

Probabilistic tests (@ProbabilisticTest) — run a test method multiple times and evaluate the observed pass rate against a threshold, with configurable confidence levels
Three experiment modes — Explore (compare configurations with small samples), Optimize (iteratively tune parameters like temperature or prompts), and Measure (establish empirical baselines with 1000+ samples)
Use cases and service contracts — define reusable success criteria with postconditions, derived checks, and duration constraints, evaluated in a fail-fast hierarchy
Spec-driven baselines — measurement experiments produce YAML spec files capturing observed success rates, confidence intervals, latency percentiles, and covariate values, committed to version control as regression baselines
Latency assertions — evaluate response times at percentile level (p50, p90, p95, p99), not averages, revealing tail behaviour that means hide
Covariate-aware matching — track environmental factors (model, temperature, time of day, infrastructure) and automatically select the most appropriate baseline for the current conditions
Budget and pacing controls — set time budgets, token budgets, and API rate limits; punit computes optimal execution pace and stops when resources are exhausted
Compliance and conformance testing — verify against mandated SLA/SLO thresholds (compliance) or detect drift from empirical baselines (conformance)
Verification vs. smoke intent — declare whether a test is an evidential claim (with enforced minimum sample sizes) or a lightweight early-warning check
The Sentinel — a JUnit-free runtime engine for monitoring stochastic behaviours in deployed environments without test framework dependencies
HTML reporting — standalone reports with per-test statistical detail, confidence intervals, z-scores, latency percentiles, and covariate mismatch warnings

The parameter triangle

You control two of three variables — sample size, confidence, and threshold — and statistics determines the third. punit supports three configuration approaches: sample-size-first, confidence-first, and threshold-first.

Get started

Visit the punit repository on GitHub for installation instructions and full documentation.

See punit examples for a complete worked application demonstrating all major features.