The Great Liberation Part II – Measuring What Matters

What can be asserted without evidence can also be dismissed without evidence. – Hitchens Razor

One of the biggest mistakes organizations make when building and testing systems is to measure it badly and then make decisions based on faulty and invalid metrics. Software testing has a long, rich history of goal displacement and metric validity problems when it comes to measurement and GenAI evaluation is currently running that same crucible.

Despite there being an abundance of metrics to look at model characteristics and performance: BLEU, ROUGE, hitchLLM-as-a-judge scores, or human preference ratings, what’s missing is any confidence that these metrics actually measure what matters.

I love the book “Measuring and Managing Performance in Organizations” by Robert D. Austin and it’s my go to reference when talking about validity problems in software testing metrics. Austin wrote about this decades ago: when performance measures are shallow proxies for your objectives, people optimize for the measure, not the outcome.

The same principle applies to measuring AI models. If you give a system a metric, it will learn how to win that metric whether it creates value or not and human evaluation is a requirement to know the difference.

This point is echoed in a paper published OXRML: Measuring What Matters, which reminds us that metrics are not reality, they are models of reality and like all models, they’re wrong but some are useful. They also point out that precision is not the same as relevance, as a precisely measured wrong thing is still the wrong thing, or what I like to call: hitting the bullseye on the wrong target.

Most of the GenAI measurement models I’ve reviewed or had demo’d for me also seem to be making the similar mistakes we make in software testing by confusing measurement with understanding. I am going to be writing and speaking a lot more about this next year, but here are my thoughts so far on what I seen and used for GenAI model evaluation.

Popular evaluation approaches like BLEU or LLM as a judge feel like counting lines of code or test cases. Big numbers seem good, but they tell you very little about risk and just as more tests don’t guarantee fewer defects, higher semantic similarity doesn’t guarantee a better answer.
Metrics requiring a golden “correct” output remind me of my experience with decades of test automation frameworks optimizing for green lights and consistency instead of correctness potentially missing all the nuance in observation.
Another similar mistake I see is measuring for form, which has the same effect of counting how many bugs were “closed” rather than how many were fixed. Hallucinations can score better than a correct answer which aligns perfectly to Daniel Kahnmens point about the power of storytelling in Thinking Fast and Slow.
Most evaluation models like to use some sort of aggregation for scoring, but my concern about this is that averages can hide risks, a problem outlined perfectly in How Complex Systems Fail. Safety issues are masked by thousands of tests and the system looks stable right up until it isn’t.

Despite the testing industry taking a hard turn in the opposite direction to human evaluation, in my opinion it’s going to be a competitive advantage but doesn’t necessarily solve these problems. Human testing suffers from the same ambiguity, bias, and inconsistency and requires oversight and careful consideration when deploying systems.

While the GenAI measurement market matures, my advice right now would be to use multiple metrics each tied to a specific question and validate your tests and metrics against real outcomes, not just benchmarks. A healthy dose of humility in our business would do wonders as well, as to not repeat the mistakes of the past in failing to delivery ROI to our business due to invalid software testing metrics.