Shewhart, Toyota, and the Probabilistic Turn
LLMs break software’s binary view of correctness. The discipline that replaces it already exists - it was built at Bell Labs, refined in Nagoya, and has been waiting for software to need it.

Bringing statistical rigour to software testing — so engineering teams can satisfy the regulatory demands of AI performance measurement and regression testing.
AI systems, probabilistic models, and non-deterministic processes are increasingly subject to regulatory scrutiny. Organisations must demonstrate measurable, reproducible evidence that these systems perform within acceptable bounds — not just once, but continuously.
Traditional unit testing assumes deterministic outcomes. In reality, that assumption never withstood scrutiny. But AI means we have no choice but to manage uncertainty professionally, and that means statistically.
Define statistical expectations for your system's behaviour. Assert against distributions, not exact values. punit gives you the vocabulary to express what "correct" means in a non-deterministic context.
Detect when your system drifts beyond acceptable bounds. Run repeatable hypothesis tests in CI/CD and catch performance degradation before it reaches production.
Produce auditable, structured evidence that your AI systems perform as expected. Give regulators, auditors, and risk committees the confidence they need.
Probabilistic unit testing for Java
punit is a JUnit 5 extension that runs tests multiple times and applies statistical inference to determine whether a non-deterministic system is behaving acceptably. Explore configurations, measure empirical baselines, and run regression tests in CI/CD — with configurable confidence levels, latency percentile assertions, and auditable verdicts.
A complete example application demonstrating punit's capabilities — including an LLM-powered shopping basket tested with explore, measure, and optimize experiments, and a payment gateway verified against SLA thresholds.
View on GitHubA Java framework that bridges deterministic application code with fallible, non-deterministic operations. Replaces try/catch with type-safe Outcome values, structured failure classification, policy-driven retries, and built-in observability.
View on GitHubLLMs break software’s binary view of correctness. The discipline that replaces it already exists - it was built at Bell Labs, refined in Nagoya, and has been waiting for software to need it.
A watershed-moment essay on why AI amplifies, rather than creates, software quality - and why the responsibility rests with those who own the system.