Evaluation Set Testing: The Foundation
The closest AI equivalent to a unit test suite is an evaluation set: a curated collection of inputs with known correct or acceptable outputs. Unlike unit tests, evaluation sets are not run to verify pass/fail — they are run to measure quality scores that are tracked over time.
- Maintain a core evaluation set of 100-500 representative inputs with expert-annotated correct outputs
- Track quality scores on every code or prompt change — regression is detected when scores drop
- Separate the evaluation set from the development process — examples used during prompt development should not be in the evaluation set
- Expand the evaluation set after every production incident — each incident is an input that the system failed on