AI systems are increasingly subject to regulatory scrutiny, yet the testing tools available to most teams were designed for a deterministic world. Javai is here to change that.
AI systems are increasingly subject to regulatory scrutiny, yet the testing tools available to most engineering teams were designed for a deterministic world. When your system’s output varies by design, how do you prove it works?
The gap
Regulators — from financial authorities to healthcare bodies — are asking organisations to demonstrate that their AI systems perform reliably and consistently. The problem is that standard testing frameworks don’t have the vocabulary for this. A JUnit assertion like assertEquals(expected, actual) simply doesn’t apply when the “correct” answer is a distribution.
What we’re building
punit is a JUnit 5 extension that runs tests multiple times and applies statistical inference to judge whether a non-deterministic system is behaving acceptably. You can explore configurations, establish empirical baselines through measurement experiments, and then run regression tests in CI/CD — with configurable confidence levels, latency percentile assertions, and auditable verdicts.
Alongside punit, outcome provides a formal boundary between your deterministic application code and the fallible operations it depends on — replacing try/catch with type-safe result values, structured failure classification, and policy-driven retries.
These aren’t academic exercises. They’re practical tools for engineering teams facing the growing regulatory demand for AI performance measurement and regression testing.
Get involved
punit and our other projects are open source. We welcome contributions, feedback, and discussion.
