Architecture8 min read8 October 2025

How to Test AI Systems: Beyond Unit Tests and Happy Paths

Testing AI systems requires fundamentally different approaches from testing traditional software. The output space is infinite, correctness is probabilistic, and the failure modes are subtle.

Ajay Prajapat

AI Systems Architect

You cannot write a unit test that verifies an LLM always gives a correct answer. The output space is effectively infinite, the correctness criterion is probabilistic, and the edge cases emerge from real-world usage in ways that cannot be fully anticipated. Testing AI systems requires a different toolkit: evaluation sets, statistical sampling, adversarial testing, and behavioural consistency checks.

Evaluation Set Testing: The Foundation

The closest AI equivalent to a unit test suite is an evaluation set: a curated collection of inputs with known correct or acceptable outputs. Unlike unit tests, evaluation sets are not run to verify pass/fail — they are run to measure quality scores that are tracked over time.

Maintain a core evaluation set of 100-500 representative inputs with expert-annotated correct outputs
Track quality scores on every code or prompt change — regression is detected when scores drop
Separate the evaluation set from the development process — examples used during prompt development should not be in the evaluation set
Expand the evaluation set after every production incident — each incident is an input that the system failed on

Adversarial Testing: Finding Failure Modes Before Users Do

Prompt injection: test whether malicious user inputs can override system instructions or extract system prompt content
Edge case inputs: empty inputs, extremely long inputs, inputs in unexpected languages, inputs with unusual formatting
Boundary case testing: inputs near the decision boundary for classification systems — where the system should be uncertain
Data quality stress testing: inputs with corrupted data, missing fields, inconsistent formats to verify error handling
Load testing: verify that quality degrades gracefully (not catastrophically) under high concurrent load

Behavioural Consistency Testing

AI systems should be consistent: semantically equivalent inputs should produce semantically equivalent outputs. Consistency testing verifies this property, which is often violated by AI systems in ways that are not caught by evaluation set testing.

Paraphrase invariance: rephrase the same query 5 different ways — outputs should be semantically equivalent
Order invariance (for classification): inputs from the same category should be classified consistently regardless of presentation order
Temperature stability: if using temperature > 0, sample the same input 5 times — all outputs should meet quality criteria
Context sensitivity testing: verify the system uses provided context and does not fabricate when context is absent

Regression Testing After Model Updates

LLM providers update base models without always announcing breaking changes. After any model update (or when you upgrade your model version), run the full evaluation set before updating production. A quality drop of >3% on any primary metric should block the update until investigated. This is the most common source of silent quality degradation in production AI systems.

Back to all articles

Key Takeaways

AI testing requires evaluation sets with tracked quality scores, not pass/fail unit tests
Expand the evaluation set after every production incident — incidents reveal inputs the system cannot handle
Adversarial testing: prompt injection, edge case inputs, boundary cases, data quality stress testing
Behavioural consistency: paraphrase invariance, order invariance, temperature stability
Run the full evaluation set after every model update — silent quality regression from model changes is common
A quality drop >3% on any primary metric after a model update should block production deployment

Apply This To Your Business

Book a strategy call to discuss how these patterns apply to your specific systems and team.

Book a Call

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.

Book a Strategy Call