THE EVALUATION PROBLEM

Your AI system is running in production.

Users are hitting it. Outputs are being generated. Dashboards show green.

And you have no reliable way to know whether any of it is actually working.

This is the evaluation problem. And it is far more widespread than the industry admits.

MIT Project NANDA surveyed over 300 enterprise AI initiatives in 2025. They defined success as systems that deliver sustained productivity gains and documented P&L impact — verified by end users and executives. By that standard, 95% of organizations deploying generative AI saw zero measurable return. Not low return. Not disappointing return. Zero.

The McKinsey Global AI Survey puts the ROI failure rate at 73%. RAND Corporation's analysis found 80.3% of AI projects fail to deliver their intended business value. What is striking is not the numbers themselves — it is what sits underneath them. MIT Sloan found that 61% of enterprise AI projects were approved on projected value that was never formally measured after deployment.

Read that again. Sixty-one percent of funded AI projects had no measurement framework at the point of approval.

The industry has spent enormous energy debating model selection, prompt design, and deployment infrastructure. Almost none of that energy goes toward the question that determines whether any of it matters: how do you actually know if your system is working?

The Difference Between a System That Runs and a System That Works

Here is the core confusion that drives the evaluation problem.

A system that runs is one where the infrastructure is up, the API calls complete, the outputs return without errors, and the dashboards show green. Most engineering teams treat this as the finish line. It is not even the starting line.

A system that works is one where the outputs are accurate, relevant, grounded in retrieved context, appropriately calibrated for uncertainty, and producing the business outcome the system was built to produce. These are not the same thing, and a running system tells you nothing about whether it is a working system.

The gap between the two is where enterprise AI money disappears. Deloitte reports that the average sunk cost per abandoned AI initiative in 2025 was $7.2 million. The median time to abandonment was 11 months — meaning organizations discovered the system did not work almost a year after they started. In most of those cases, the system was "running" the entire time. The logs looked fine. The latency was acceptable. The outputs were coherent-looking text.

They just were not producing the right outputs. And no one had built the instrumentation to detect that until it was expensive to fix.

Why Most Teams Measure the Wrong Things

The evaluation problem has a specific character that makes it persistent. It is not that teams skip measurement entirely. It is that teams measure the things that are easy to measure and mistake those for the things that matter.

The most common proxy metric in enterprise AI is response quality as judged by the development team. Engineers review a sample of outputs during development, conclude the system is working well, and ship it. This measurement approach has two fatal flaws.

The first flaw is distribution shift. The outputs that look good to engineers during development are the outputs on inputs that engineers anticipated. Production users hit the system with inputs that were not anticipated — different phrasings, edge cases, adversarial queries, inputs from workflows the system was not designed for. The evaluation that happened during development tells you nothing about how the system behaves on those inputs.

The second flaw is that "looks good" is not a business outcome. A clinical documentation agent that produces fluent, well-structured notes that contain subtle factual errors looks good. A contract review agent that misses a key indemnification clause because the clause was phrased unusually looks good. A financial analysis agent that grounds its reasoning in slightly stale data looks good. None of them are working.

LangChain's 2026 State of AI Agents report found that quality is the top barrier to agent deployment, cited by 32% of organizations. What "quality" means in that context is not model capability — it is the inability to measure output quality consistently enough to trust the system for production workloads. You cannot deploy reliably what you cannot measure reliably.

The Three Layers of Evaluation That Enterprise Teams Skip

Production-grade AI evaluation is not a single check. It is a stack of measurement disciplines, and most enterprise teams are missing at least two of the three layers.

❝

Last week I covered the orchestration trap — the architectural failure that collapses agent pipelines before they ever reach production. The Orchestration Trap is where systems fail structurally. The evaluation problem is where they fail silently. Both need to be solved. Neither can substitute for the other.

A system that runs and a system that works are not the same thing. Most teams only measure the first.

Layer 1: Offline evaluation before deployment.

This is the layer most teams implement in some form, but usually incompletely. Offline evaluation means testing the system against a curated dataset before any production traffic touches it. The goal is to establish a performance baseline across the metrics that matter for your specific use case — not generic benchmarks, but use-case-specific metrics.

For a RAG system, that means measuring faithfulness (does the output reflect what the retrieved context actually says), contextual recall (did retrieval surface the relevant information), and hallucination rate. For an agent system, it means measuring task completion rate across a representative set of tasks — including the failure cases, the edge cases, and the adversarial inputs that will appear in production.

The failure mode here is using generic benchmarks as proxies for production readiness. A model that scores well on MMLU tells you something about general reasoning capability. It tells you nothing about whether your RAG pipeline retrieves the right documents for your specific corpus, or whether your agent completes the specific tasks your users will actually request.

Stanford's research demonstrates that systematic evaluation reduces production failures by up to 60%. The teams implementing that reduction are not running generic benchmarks — they are building evaluation datasets that reflect their actual production conditions.

Layer 2: Continuous evaluation in production.

This is the layer almost no one implements well, and it is the one that matters most.

Production systems degrade. Models update. Data distributions shift. New user behaviors emerge that the system was not designed for. A system that evaluated clean at deployment will not necessarily evaluate clean six months later, and without continuous measurement in production, you will not know it has degraded until users or regulators tell you.

Continuous production evaluation requires three things. Real-time monitoring of the metrics established in offline evaluation, so drift is caught immediately rather than retrospectively. Sampling strategies that surface representative production outputs for human review — because automated metrics have blind spots that human judgment catches. And alerting thresholds that trigger investigation before degradation becomes failure.

Most teams have production logging. Almost none have production evaluation. These are not the same thing. Logs tell you what happened. Evaluation tells you whether what happened was right.

Layer 3: Business outcome measurement.

This is the layer that the entire AI investment thesis depends on, and the layer that MIT Sloan found missing in 61% of funded projects.

Technical evaluation metrics — hallucination rates, faithfulness scores, task completion rates — are necessary. They are not sufficient. The system exists to produce a business outcome: faster contract review, more accurate clinical documentation, fewer escalations in customer service, measurable reduction in analyst time on specific workflows.

If you cannot draw a direct line from your technical evaluation metrics to your business outcome metrics, you do not have a measurement framework. You have instrumentation that tells you the system is running.

Business outcome measurement requires defining the outcome metric before the system is built. What changes if this system works? How is that change measured? What is the baseline? What is the target? These questions need answers at the point of project approval, not retrospectively when the CFO asks why the investment has not paid off. The 54% project success rate reported by RAND for teams with clear pre-approval metrics versus 12% for teams without them is not a coincidence. It is the measurement discipline working exactly as it should.

What This Looks Like in Regulated Industries

In FinTech, Healthcare, and Legal, the evaluation problem has an additional dimension that enterprise teams often treat as separate from the measurement conversation. It is not.

Regulatory compliance in these industries requires that AI system outputs be explainable, auditable, and consistent within defined tolerance thresholds. That is not just a governance requirement. It is an evaluation requirement.

A credit decisioning system needs to demonstrate that its output quality is consistent across demographic groups — not just overall. A clinical documentation system needs to show that its error rate on specific clinical entity types falls within defined bounds. A legal research system needs to demonstrate groundedness to source material at a level that a supervising attorney can verify.

These evaluation requirements are stricter, more specific, and more consequential than the evaluation requirements most teams design for. The organizations getting this right in 2026 are the ones that designed their evaluation frameworks with regulatory requirements as first-class inputs — not as a compliance layer bolted on after the fact.

Apple suspended its AI news summary feature in January 2025 after generating misleading headlines. Air Canada was held legally liable after its chatbot provided false refund information. CNET published finance articles riddled with AI-generated errors. In each case, the system was running. In each case, no one had the measurement infrastructure to catch the failure mode before it became a public or legal problem.

In regulated industries, that sequence is not just expensive. It is existential.

Building an Evaluation Stack That Actually Works

The practical question is not whether to evaluate — everyone agrees evaluation is necessary. The practical question is how to build a measurement stack that is rigorous enough to catch real problems without being so heavy that teams abandon it.

Start with your production failure modes, not generic metrics. What specific ways can this system fail in your production context? Build evaluation cases around those failure modes. A RAG system deployed for legal research fails differently than a RAG system deployed for customer service. Generic benchmarks will not catch the legal-specific failure modes.

Separate your evaluation stack from your observability stack. Observability tells you the system is running. Evaluation tells you the system is working. They need different instrumentation, different alerting, and different human review processes. Conflating them is where most teams lose the signal they need.

Build human evaluation into the loop, not as a replacement for automated metrics but as a complement to them. Automated metrics catch the measurable failure modes. Human reviewers catch the subtle ones — the output that is technically faithful to the retrieved context but contextually wrong for the task, the response that scores well on relevancy but misses what the user actually needed. Lang Chain's 2026 report notes that 57% of organizations have agents in production. The ones with reliable quality measurement are the ones that combined automated and human evaluation from the start.

Instrument for drift from day one. Production systems change. Set baseline metrics at deployment. Alert when those baselines move. The teams that discover quality degradation through user complaints or audit findings have already lost months of production output.

What Changes in the Next 18 to 36 Months

Evaluation is moving from a deployment checkpoint to continuous infrastructure. The tooling is maturing — DeepEval, Arize, Langfuse, Maxim AI, and RAGAS are each solving real pieces of this stack. Regulatory pressure, particularly from the EU AI Act and emerging US state regulations, is making evaluation a compliance requirement rather than an engineering best practice.

The organizations that build rigorous evaluation infrastructure now are building something more durable than competitive advantage in AI outputs. They are building the organizational capability to know when their systems are working and when they are not — which is the prerequisite for everything else.

The teams that skip this step will keep discovering the same thing: that a system running is not a system working. They will keep finding out 11 months in and $7 million later.

The evaluation problem is solvable. But it requires treating measurement as architecture — designed upfront, maintained continuously, and tied directly to the business outcomes the AI investment is meant to produce.

You cannot govern what you cannot measure. You cannot improve what you cannot measure. And you cannot trust what you cannot measure.

Build the measurement infrastructure before you build anything else.

❝

Found this useful? Forward it to anyone in your organization who approved an AI project without a measurement framework. They need this more than you do.

❝

New here? Every Tuesday, The Deployment Layer publishes one deep-dive on enterprise AI architecture, agent systems, and responsible AI governance. Subscribe free at thedeploymentlayer.com

❝

Running AI in production? I want to know what your evaluation stack actually looks like. Hit reply — the real answers shape future issues.

I am Gauri, a senior AI leader focused on enterprise AI strategy, LLM architecture, and Responsible AI governance. LinkedIn | X |Medium

What does your production evaluation stack actually look like — and what failure mode did it catch that saved you? Or didn't catch, and cost you? I read every response.