Architecture8 min read8 October 2025

How to Test AI Systems: Beyond Unit Tests and Happy Paths

Testing AI systems requires fundamentally different approaches from testing traditional software. The output space is infinite, correctness is probabilistic, and the failure modes are subtle.

AP

Ajay Prajapat

AI Systems Architect

You cannot write a unit test that verifies an LLM always gives a correct answer. The output space is effectively infinite, the correctness criterion is probabilistic, and the edge cases emerge from real-world usage in ways that cannot be fully anticipated. Testing AI systems requires a different toolkit: evaluation sets, statistical sampling, adversarial testing, and behavioural consistency checks.

Evaluation Set Testing: The Foundation

The closest AI equivalent to a unit test suite is an evaluation set: a curated collection of inputs with known correct or acceptable outputs. Unlike unit tests, evaluation sets are not run to verify pass/fail — they are run to measure quality scores that are tracked over time.

  • Maintain a core evaluation set of 100-500 representative inputs with expert-annotated correct outputs
  • Track quality scores on every code or prompt change — regression is detected when scores drop
  • Separate the evaluation set from the development process — examples used during prompt development should not be in the evaluation set
  • Expand the evaluation set after every production incident — each incident is an input that the system failed on

Adversarial Testing: Finding Failure Modes Before Users Do

  • Prompt injection: test whether malicious user inputs can override system instructions or extract system prompt content
  • Edge case inputs: empty inputs, extremely long inputs, inputs in unexpected languages, inputs with unusual formatting
  • Boundary case testing: inputs near the decision boundary for classification systems — where the system should be uncertain
  • Data quality stress testing: inputs with corrupted data, missing fields, inconsistent formats to verify error handling
  • Load testing: verify that quality degrades gracefully (not catastrophically) under high concurrent load

Behavioural Consistency Testing

AI systems should be consistent: semantically equivalent inputs should produce semantically equivalent outputs. Consistency testing verifies this property, which is often violated by AI systems in ways that are not caught by evaluation set testing.

  • Paraphrase invariance: rephrase the same query 5 different ways — outputs should be semantically equivalent
  • Order invariance (for classification): inputs from the same category should be classified consistently regardless of presentation order
  • Temperature stability: if using temperature > 0, sample the same input 5 times — all outputs should meet quality criteria
  • Context sensitivity testing: verify the system uses provided context and does not fabricate when context is absent

Regression Testing After Model Updates

LLM providers update base models without always announcing breaking changes. After any model update (or when you upgrade your model version), run the full evaluation set before updating production. A quality drop of >3% on any primary metric should block the update until investigated. This is the most common source of silent quality degradation in production AI systems.

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.