Why AI evals need to reflect the real world

Opinion: Rumman Chowdhury and Mala Kumar argue that we need better AI evaluations — and the infrastructure and investment to do them

Sep 10, 2025

*Credit: Elise Racine & The Bigger Picture / Better Images of AI / Licenced by CC-BY 4.0*

Generative AI systems don’t work as intended in the real world.

Surveys show enthusiasm within enterprises for rapid experimentation with AI. They also show persistent concerns inside companies over practical issues such as reliability, quality, governance and integration. The mismatch between promise and reality has stalled deeper adoption, despite the clear potential return on investment of AI technologies.

We know one of the reasons why this disconnect is happening.

Although GPT-5 is touted as a state-of-the-art model across benchmarks that are supposed to evaluate its performance, the model can’t do basic tasks like create an accurate map of the United States. Similarly, last year, a coding agent called Devin, developed by the AI lab Cognition, grabbed attention with splashy autonomy demos and impressive SWE‑bench results, but hands-on trials found it to be less impressive, even though SWE-bench is supposed to assess software development capabilities. Devin performed unevenly, unexpectedly failed when using some software tools, and created slow, error‑prone code loops when problems got complex. The problem in both cases isn’t poor performance, but unreliable performance.

These stories are not the exception, but the norm. Very often, AI systems and models that perform impressively against a benchmark and rank high on benchmark leaderboards then deliver inconsistently under real world constraints, revealing a clear gap between test success and operational reliability.

While benchmarks and their leaderboards can be valuable performance snapshots in time, they are not reflective of, nor should they be accepted as, reality. Leaderboards amplify Goodhart’s Law; that is, when the metric becomes the target, it stops measuring what matters. It’s not surprising that leaderboards have been shown to foster cherry‑picking, prompt overfitting, and metric gaming. In other words, some AI researchers create tests that make their models look better or “train to the test,” so the definition of performance has little to do with real-world abilities but rather how well they score on these benchmark evaluations.

Unsurprisingly, this leads to models that perform well on benchmarks but not under normal use. The problem can be even worse if a benchmark is leaked or contaminated. In such cases, models can be trained on the “test” questions or otherwise optimized to artificially perform well in the benchmark’s domain space. This renders the benchmark meaningless.

Despite these known drawbacks, the AI industry has overindexed on benchmarks and their leaderboards as our primary AI evaluation mechanisms. So where do we go from here?

One parallel we can draw from is user experience (UX) research and design, which over the past few decades has evolved into a spectrum of established, scientific, data-backed methods that meet needs at any point in a product’s design, development or implementation lifecycle. UX research and design has formalized an entire industry of digital evaluations by establishing when it is appropriate to measure what, and for what purpose. Because of UX research and design, it’s commonly understood that it is not possible to install analytics on a product that doesn’t exist, and it doesn’t make sense to generate new product ideas when the immediate goal is to improve what has already been built.

We see the current state of AI evaluations along a similar spectrum. Some methods, like red teaming, are great starting points to generate ideas and collect data about points of failure in an AI model or system. Other methods, like benchmarks, are better evaluation mechanisms for later in the product lifecycle, once it’s clear what exactly needs to be evaluated, and why.

Yet in artificial intelligence, the basic starting points of evaluation have largely been ignored in favor of benchmarks that sound impressive but are misaligned to what we need — the AI equivalent of installing useless analytics on a product. It’s unsurprising then, that AI product rollouts result in poor, unpredictable and inconsistent performance. And it’s no wonder that enterprise pilots fail to scale. To get past these blockers, we must design, build and reward systems that complete work predictably in messy environments, rather than building ones that simply ace static quizzes under lab conditions.

Are AI scheming evaluations broken?

Nikita Ostrovsky

Sep 1

Read full story

In other words, we must re-evaluate how we evaluate. AI evals shouldn’t be a box‑ticking compliance chore, nor a showcase for how good a model is at complex math or physics problems. They should be a product testing mechanism for AI systems. Products that have baked-in AI evals as part of their design and development lifecycles are best placed to actually deliver desired outcomes. We should treat evals accordingly: by testing individual AI product workflows with product and engineering teams during pre‑release and in production, adjusting those tests to user journeys and acceptable risk.

After two years of conducting AI evals through our nonprofit organization, Humane Intelligence, we have learned that good evaluation requires its own infrastructure. Robust product evaluations must focus on good experimental design, input from a wide range of users and experts to create customized testing scenarios, and data from real-world users that replicate unintended adverse scenarios. Evaluations should use that data to gather insights on not only what failures happen, but how these failures occur. That's why we have launched a new public benefit corporation, also called Humane Intelligence, through which we are building the infrastructure for the next generation of real world AI evaluations at scale.

We know one-off benchmarks and leaderboards aren’t enough in the real world. The path forward is building up robust evaluation mechanisms so that more organizations and people can meaningfully assess their AI models and systems before deploying them. Only then will enterprises — and society at large — trust AI enough to adopt it at scale. AI adoption will depend not just on building smarter models, but also smarter evaluations of those models.

Rumman founded Humane Intelligence public benefit corporation in 2025, which is building a next-gen AI evaluations infrastructure for the real world. She is the former Managing Director of Responsible AI at Accenture, Engineering Director of ML Ethics at Twitter, and co-founded Humane Intelligence, the nonprofit, in 2022.

Mala is the Interim Executive Director of Humane Intelligence, the nonprofit, which runs rigorous AI evaluations to break down barriers to AI deployment for social good. She is a former director at GitHub and worked extensively at the United Nations.

Learn more about the organizations at https://humane-intelligence.org and https://humane-intelligence.com.

Why AI evals need to reflect the real world

Opinion: Rumman Chowdhury and Mala Kumar argue that we need better AI evaluations — and the infrastructure and investment to do them

Are AI scheming evaluations broken?

Discussion about this post