Evaluating AI Systems

Evaluating AI Systems — CRIN

You cannot improve what you cannot measure — but measuring LLM quality is genuinely hard. Automated benchmarks measure proxies, not what users care about. Human eval is expensive and slow. LLM-as-judge is scalable but biased. Regression testing catches regressions. The answer is all four, used together.

Course: Advanced.

This lesson covers 5 concepts: Why Eval is Hard, Public Benchmarks, LLM-as-Judge, Human Eval, Regression Testing.

Why Eval is Hard

Evaluating LLM outputs is hard because quality is multidimensional, context-dependent, and often subjective. Automated metrics (length, BLEU, perplexity) consistently reward the wrong things — verbose responses score higher than concise ones, even when concise is better.

The entire history of ML is full of metric gaming — systems that score highly on benchmarks but fail in deployment. LLM eval is the latest battleground for this problem.

Measuring an LLM is harder than measuring a chess engine or a spam filter, where there is a clear right answer. LLM quality depends on the task, the user, and the context.

Response A: "Paris." Response B: 3 sentences about Paris's cultural heritage. B scores higher on BLEU, semantic similarity, and length metrics. A is what most users actually want. This is the eval problem in one example.

Public Benchmarks

Public benchmarks measure specific capabilities — MMLU measures knowledge breadth, GSM8K measures mathematical reasoning, HumanEval measures code generation. Their relevance to your use case varies — MT-Bench (0.82) may predict chat quality better than HellaSwag (0.49) for a conversational product.

Select benchmarks that correlate with your actual use case. For a coding assistant: HumanEval. For a customer service bot: MT-Bench. For a medical application: MedQA, MedMCQA.

Benchmark scores are a starting point, not an answer. A model with high MMLU may still fail at your customer support task — because MMLU does not test customer support.

GPT-4 on MMLU: 86.4%. Gemini Ultra on MMLU: 90.0%. Does this mean Gemini Ultra is better for your use case? Only if your use case is multiple-choice trivia. Real performance requires task-specific evaluation.

LLM-as-Judge

LLM-as-judge uses a strong model (GPT-4) to evaluate outputs from weaker models on explicit rubrics. Achieves 80%+ agreement with human evaluators at 100× lower cost — the standard scalable eval technique for production AI.

LLM-as-judge enables continuous evaluation at scale: sample 5% of production queries, evaluate automatically, alert on metric drops. Impossible with human eval at volume.

Human eval is the gold standard but slow and expensive. Automated metrics are fast but measure the wrong things. LLM-as-judge is the practical middle ground for production systems.

GPT-4 as judge on a helpfulness rubric: 82% agreement with human preference labels. At $0.01 per eval vs $0.50 for human annotation: 50× cheaper while maintaining most of the signal.

Human Eval

Human preference evaluation (pairwise): show two model outputs side-by-side, ask which is better. Model A preferred 44%, Model B 31%, Tie 14%, Both bad 7%. Model A wins — but the 7% "both bad" rate reveals a quality floor to fix.

"Both bad" responses are the most actionable signal. If 7% of queries produce bad outputs from both models, that is a systematic capability gap — find those queries and fix the root cause.

Human evaluation is slow and expensive, but it is the ground truth. LLM-as-judge and benchmarks are proxies — when they conflict with human eval, human eval is right.

Chatbot Arena's Elo rating: 100,000+ human pairwise comparisons. GPT-4o: Elo 1287. Claude 3.5 Sonnet: 1268. Gemini 1.5 Pro: 1254. The most credible public model ranking because it is user-driven, blind, and at scale.

Regression Testing

Regression testing runs your golden test set automatically before every model or prompt change. Any metric drop triggers a deployment block. This is the difference between an AI product that silently degrades and one that maintains quality.

Regression testing is to AI systems what unit tests are to software. The question is not "should we test?" but "how do we set up the infrastructure to test automatically?"

Without regression testing, you never know if a "fix" caused a regression somewhere else. With it, every change is measured — and bad changes are blocked before they reach users.

Prompt change: "improved" system prompt deployed without regression testing → helpfulness up 2%, but format breaks 12% of the time. Caught only after user complaints. Regression testing would have caught it before deployment.