
“The secret of life is honesty and fair dealing. If you can fake that, you’ve got it made.”
– Groucho Marx
For as long as I’ve been in the software testing business, a consistent myth I’ve encountered is that developers and testers share the same mindset. Usually this is accompanied by the view that testing is just an “activity as part of development,” and therefore developers can do as good a job as testers at it and in many cases, are positioned to do a better job.
Ironically, the reality is that the people most confident in their ability to evaluate a system’s quality are usually the least able to do so with any objectivity. In just about every other discipline of engineering this isn’t a hot take or viewed as a criticism. It’s human nature.
So today, as AI has entered our daily lives, in the face of the enormity of the task of testing artificial intelligence systems, the idea of using LLMs to judge their own output has become mainstream. But pretending that developers alone can test or limited “human in the loop” checking has gone from a novel optimistic opinion to borderline reckless.
The 2019 paper Developer Testing in the IDE: Patterns, Beliefs, and Behavior reported a large-scale field study with over 2000 software engineers over 2.5 years in four IDEs and the results were pretty straight forward: developers overestimate how much testing they do, misjudged the scope of that testing, and displayed large gap in perception between what they thought they did and what they actually tested.
The finding won’t come as a surprise to anyone who has professionally built software or been personally responsible for delivering systems with real consequences:
- half of the developers didn’t test at all
- most development sessions ended with zero testing
- a quarter of the test cases were responsible for three quarters of all failures
- TDD is not widely practiced
- developers spent 25% of their time writing tests – but think they spend half
As I (and lots of other professional testers) have written before, the problem lies with confirmation bias. Most people don’t even think about it when you’re creating something, you’re too busy unconsciously looking to see if it works the way you intend. And when you test it, you’re not testing for failure or ways it might NOT work – you’re seeking validation.
Independent testers, if they’re doing their job right, should be professional skeptics. They view the exact same system and look for ambiguity, known and unknown risks, confusing behavior to investigate. It’s an entirely different mindset that you just cannot “context switch” out of as its deeply rooted in experience, mission, and the mental models we use to navigate the system under test.
So that brings us to the super-charged confirmation bias machines called LLMs and their role in testing and evaluating generative AI.
The 2021 paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? carries with it a warning to anyone thinking AI should be used a judge of correctness in testing. As is well documented by the authors, LLMs can’t evaluate what’s true, they predict patterns that resemble truth, and despite how convincingly it may seem, they don’t understand or have empathy, they only reassemble data.
And the larger these models are, the harder they are to test but are increasingly used to evaluate code, designs, architecture, and even defects. This risk in isolation could be mitigated through scale, but that’s not what is happening. The pressures of implementing AI for business ROI coupled with a “continuous everything” culture on steroids means teams are effectively outsourcing their front-line critical thinking skills to a confirmation bias monster parrot.
And LLMs are designed perfectly to amplify those pressures and bias.
Testing LLMs and systems either AI native or enabled must do more than just simulate developer validation or fill the gaps left by their testing, it must provide critical, adversarial analysis of the system under test. At risk of being accused of my own confirmation bias, I believe that the future of software quality and testing will be shaped by skilled testers who can challenge assumptions and articulate business risk, not just parrot what makes us feel good about our work.
“Does it work” needs to be replaced by “What if it goes wrong”.
I don’t yet know exactly how to answer the question of where humans should sit in the evaluation of LLMs at the scale they are being deployed and relied upon, but I do know that how we answer that question is quickly becoming the most important differentiator in testing – and probably all of technology.
Discover more from Quality Remarks
Subscribe to get the latest posts sent to your email.