This is the Evals problem. And it's one of the most expensive mistakes in AI development, because you don't feel the pain until it's too late.
The Hogwarts Problem
In Harry Potter, every student eventually faces their O.W.L. examinations, Ordinary Wizarding Levels. They're not a formality. They're a structured, standardized assessment designed to answer one question: do you actually know what you claim to know?
Imagine Dumbledore skipped the OWLs. Students would just show up to their classes, practice a few spells, and graduate based on the professor's general impression.
Some would be genuinely skilled. Others would be dangerously underprepared. And nobody would know which was which until someone tried to cast a spell in an actual emergency.
This is exactly what most companies do with their AI.
They deploy it based on a general impression, it seemed good in testing, with no structured baseline, no way to measure performance over time, and no early warning system when things quietly get worse.
Evals are the OWLs. They're not the fun part of AI development. But they're what separates a tool you can trust from a tool you're hoping works.
What Evals Actually Are
An "eval" (short for evaluation) is simply a structured test you run against your AI.
You create a set of known inputs, questions, prompts, scenarios, and define what a good output looks like. Then you run your AI against that test set and score the results.
That score becomes your baseline.
Every time you change something, your prompts, your data, your model, you re-run the evals. If the score goes up, the change was an improvement. If the score drops, something broke.
Simple in concept. Surprisingly rare in practice.
Why Most Teams Skip This Step
The honest answer is that evals feel like overhead.
You're already behind on building the actual product. Writing 100 test cases, defining what "correct" looks like, and building a scoring framework takes real time. It doesn't ship features. It doesn't impress stakeholders.
So teams skip it. They rely on vibes: internal testing, a few demos, general impressions.
The problem is that AI performance degrades in subtle, invisible ways.
- The model provider updates the underlying model. Your prompts no longer produce the same outputs.
- You add a new data source to your pipeline. The AI starts referencing stale or conflicting information.
- You change your system prompt to improve one use case. It quietly breaks three others you didn't think to test.
- Your user base shifts. The queries they're asking look different from the ones you tested against.
None of these failure modes announce themselves. They just silently make your AI worse.
And without evals, you find out when a customer complains, or worse, when they quietly stop using the tool.
The Four Types of Evals (And Which One You Should Start With)
1. Exact Match
The simplest form. The AI's output must match a predefined answer exactly.
Good for: SQL generation (is this query syntactically correct?), classification (did the AI categorize this correctly?), structured data extraction.
Not good for: open-ended responses where there are many valid answers.
2. Contains Match
The output must include specific information, but the exact wording doesn't matter.
Good for: customer service tools (did the AI mention the refund policy?), research tools (did the summary include all key facts?).
3. Human Rating
A human reviews the output and rates it on a scale, typically 1 to 5.
Good for: catching quality issues that are hard to define programmatically. Bad for: scale. You can't human-rate a thousand outputs per day.
4. LLM-as-Judge
You use a second AI model to evaluate the output of your first AI model. You give the evaluator model a rubric and ask it to score the response.
This is becoming the most popular approach for teams at scale. It's fast, it's consistent, and it's surprisingly effective, as long as your rubric is well-designed.
For most teams just getting started: begin with Contains Match. It's simple, it's automatable, and it catches the most common failure modes.
A Practical Example
Say you've built an AI tool that answers customer questions about your return policy. Here's what a basic eval looks like:
| Input Question | Must Contain | Must NOT Contain | Pass? |
|---|---|---|---|
| How long do I have to return an item? | "30 days" | "60 days", "no limit" | ✓ Pass |
| Can I return a final sale item? | "final sale", "cannot be returned" | "yes", "no problem" | ✗ Fail |
Run 50–100 of these. Calculate your pass rate. That's your eval score.
Now you have a number. Next time you make a change, you run the evals again. If the score was 91% and it drops to 78%, something broke. If it climbs to 95%, the change worked.
You've gone from flying blind to flying with instruments.
Three Signs You Need Evals Right Now
- You've changed your prompts more than twice. Every prompt change is a bet. Without evals, you don't know if it paid off.
- You've onboarded a major new user group. Different users ask different questions. Your original test cases may not represent them at all.
- You've connected a new data source. New data introduces new inconsistencies. Evals catch them before your users do.
What "Good" Looks Like
A mature eval setup for a business AI tool looks like this:
- A test set of 100–200 curated inputs covering your most important use cases
- Clear, documented criteria for what a passing output looks like
- An automated scoring script that runs every time the tool is updated
- A dashboard showing your eval score over time
- An alert that fires if the score drops below a defined threshold
You don't need all of this on day one. Start with the test set and the scoring criteria. The rest can be built incrementally.
The goal isn't perfection. The goal is visibility.
Because right now, the only feedback loop most AI teams have is: did someone complain?
That's not a feedback loop. That's a fire alarm. And fire alarms are not a quality assurance strategy.
Final Thoughts
Harry Potter didn't just practice spells in his room and hope for the best. He had structured assessments. He had professors who pushed back. He had a system designed to surface what he didn't know before it mattered.
Most AI tools go to production with none of that.
They're shipped based on demos, intuition, and the general sense that it worked when we tested it.
Evals are the unglamorous infrastructure that changes this. They don't make your AI smarter. They make you smarter about your AI.
And in a space where the stakes are getting higher every month, that visibility is not optional.
It's the difference between an AI tool your organization trusts, and one it tolerates until something goes wrong.
