Skip to main content
Strategy & Governance7 min read

Evaluations are Table Stakes. Outcomes are Not.

Six months ago, "we run rigorous AI evaluations" was a defensible thing for an AI-services firm to say. Today every major enterprise platform ships built-in evaluators by default, and even custom evaluators only measure whether the model is behaving. None of them measures whether the workflow is still moving the KPI you bought it to move. That is the layer above evals, and it is the one most enterprise AI work skips.

Published May 4, 2026 by David Suydam

If you are nine months into an enterprise AI pilot, you have probably seen this pattern. The vendor's monthly review walks through evaluation dashboards. Relevance scores look healthy. Hallucination rates are flat. Tool-call accuracy is up and to the right. Safety filters caught everything they should have. The deck closes with a recommendation to expand to the next workflow. Then you open the operations report on the same workflow and the cost-per-resolution has crept back to where it was before the pilot started. The handle time is roughly the same. The KPI you actually bought the AI to move has not moved. The model is behaving. The business is not winning.

Six months ago, "we run rigorous AI evaluations" was a defensible thing for a firm like ours to say. It is not anymore. Microsoft Azure AI Foundry, AWS Bedrock, and Google's Gemini Enterprise Agent Platform now ship built-in evaluators by default. Relevance, safety, coherence, tool-call accuracy, groundedness. The catalogue is documented and on the platforms operators are already buying. The Bedrock AgentCore Evaluations service went generally available in March. The Gemini Enterprise Agent Platform absorbed the Vertex AI evaluation surface into a unified agent platform on April 22nd. Azure Foundry's built-in evaluator catalogue is the most exhaustive of the three. Across roughly two quarters, the category-level claim "we do evals well" stopped distinguishing one AI-services firm from another.

That shift is not the interesting part. The interesting part is what the shift makes visible. The evaluator catalogue, however good it gets, is measuring whether the model is working. It is not measuring whether the operator's workflow is still moving the number it was bought to move. Those are two different questions, and the one operators are paying for is the second one.

Where custom evaluators genuinely earn their keep

Before going further, the easy and wrong reading of the platform-parity fact is "built-in evaluators are enough, custom evaluators are noise." That is not the argument. Built-in evaluators have real gaps and the platforms themselves say so.

Microsoft's own custom-evaluator documentation names the boundary directly: built-in evaluators do not cover domain-specific accuracy, brand tone, or output format compliance. AWS Bedrock's documentation goes further, naming healthcare and finance as examples where business-rule conformance, regulatory citation accuracy, and audit-trail completeness require custom logic. AWS also concedes a sharper limit. Tool-level evaluators cannot catch missing tool calls, because if the agent skips a tool entirely there are zero tool calls to score and the evaluator returns a passing grade in silence. Multi-step agent workflows produce a class of failure where every individual step scores fine and the end-to-end task still fails, because errors in step two corrupt step three which corrupts step four. Braintrust and Anthropic both name the pattern in their engineering writing.

Custom evaluators close that gap.

  1. Schema and structural validation of non-text outputs. A generated Excel file that is technically valid but writes "ARR Trailing 12" into a cell the downstream system reads as "ARR YTD."

  2. Domain-specific compliance rules. Whether the response correctly applies a tiered discount or cites the right regulatory clause.

  3. Multi-step trajectory checks that watch the full agent turn rather than the individual tool call.

Hamel Husain's framing is the cleanest summary: define binary failure modes based on real problems, build custom evaluators for those failures, validate them against human judgment. Generic metrics will not catch a real product failure and good scores on them do not mean the system works.

So the argument is not that custom evaluators are unnecessary. They are necessary for any serious production AI workflow and we build them on every engagement. The argument is that the entire eval layer, built-in plus custom, sits below the layer where the value contract actually lives.

What neither built-in nor custom evaluators measure

Neither layer measures whether the operator's KPI moved.

The closest the practitioner literature comes to naming this directly is the Confident AI playbook. "Even if the metrics worked, they didn't map to a business KPI. You couldn't connect the scores to real-world outcomes." Anthropic's engineering write-up draws the same line by distinguishing evaluations from A/B testing: evaluations support shipping confidence and surface model behaviour problems, A/B testing measures actual user outcomes. The major-vendor framing stops there. It does not name a third tier for tying the evaluation surface to the operational KPI to the financial number on the operator's P&L. That third tier is what most enterprise AI work skips, and it is the layer the dashboards do not cover.

In the engagements we have watched closely, this is where AI ROI quietly dies. The model continues to behave. The evaluator scores stay green. The workflow does not move the number. By the time someone notices, two quarters of investment have shipped and the original business case has been forgotten.

Built-in platform evaluators and custom evaluators both measure model behaviour. Outcome assurance is the layer above, where the value contract lives, and is what most enterprise AI work skips. · url: https://a.storyblok.com/f/291042820758376/5298/c96d61909c/three-layers-of-ai-evaluation.svg

We learned this on ourselves

A few months ago, an internal Architech review presented our evaluations practice as a forward-looking firm differentiator. The work behind that practice was real and rigorous and remains so. Inside the same review, the team was also designing a workflow-redesign discipline with explicit business-outcome targets. The two collided. Within roughly two quarters, the platforms shipped the same evaluator catalogue we were positioning around, and the differentiator depreciated. Nobody got it wrong. The category moved fast and we tightened the playbook in response.

The tightening looks like this. The eval discipline stays, because rigorous evaluation is a prerequisite for anything else. Custom evaluators stay, because the gaps in built-in catalogues are real and matter on every engagement. What we add on top is a layer that does not exist in the platform documentation: a discipline that ties the evaluation surface to the operator's KPI to the financial number, and runs continuously past launch rather than at launch. We call it outcome assurance. The phrase is not industry-standard yet, and the principle of tying evaluation to business outcomes has been articulated under other names. The novelty is the operator-facing packaging, not the principle.

What the discipline looks like in delivery

The shape is easier to see in a real engagement. We recently delivered a finance workflow for a Canadian healthcare services company. Five-agent system on Azure AI Foundry, structured-data normalization with financial-category mapping, analyst-grade output. Three controls carry the discipline.

  1. Confidence scoring with human review on low-confidence financial mappings. The agent does not silently push a low-confidence categorization into the output. A reviewer sees it before it lands.

  2. Two-layer testability that separates semantic correctness from structural correctness. The semantic layer asks whether the model interpreted the input correctly. The structural layer asks whether the output is valid against the downstream contract. They fail differently and they fix differently, and collapsing them into a single score loses both signals.

  3. Golden Query sets in CI/CD with variance checks. Real client queries paired with verified expected outputs run on every change, and the variance check flags drift before it reaches a production user.

None of these is exotic on its own. The discipline is in keeping all three running on a cadence and tying each one to a metric that maps to the workflow's business KPI, not just to the model's behaviour.

The strongest counter-position, taken seriously

The cleanest counter-argument comes from the eval-first practitioner voice. Husain's case, made well, is that error analysis is the most important activity in evaluations and that you cannot tie an evaluator to a KPI until you have first built the evaluator that detects the failure. Braintrust's case is that without evaluation infrastructure every deployment is a guess. Both are right. We use both in delivery. The question for an operator weighing a vendor is not whether the vendor runs evaluations, because every serious vendor does and the platforms ship the basics by default. The question is whether the vendor can show the linkage from an evaluator to an operational KPI to a financial number on your P&L. Strong eval rigor is the substrate. It is necessary. It is not the same thing as outcome assurance and it does not by itself produce the second.

A test you can run this week

Open the most recent AI vendor or managed-services proposal on your desk. Look for a single page, slide, or paragraph that traces a line from one of the evaluators the vendor is selling you, to the operational KPI that is supposed to move, to the financial number on your P&L that the operational KPI rolls into. If you can find it, the proposal is unusual. Most cannot. If a vendor can name the workflow, the operator whose job changes, the metric that moves, and the cadence on which the linkage is checked, that is a proposal that has been thought through past the dashboard. If they can only describe relevance scores and hallucination rates, they are selling plumbing.

Eval metric moves operational KPI rolls into financial number. Checked on a cadence, not at launch. Most vendor proposals do not draw this line.

This is the same posture the one-claim-test piece argued from the homepage side. Read the firm's own claim, run it against a test the operator can apply inside ninety days, and most do not pass. The eval version of the test is narrower. Show me the linkage from an eval metric to an operational KPI to a financial number. The vendor can answer or they cannot.

The discipline that makes the answer concrete on our side now lives at architech.ca/outcome-assurance. The page names what we measure, how we tie it to a KPI, and what continuous looks like in practice. The blog you are reading is the story of how we got there. The page is the discipline itself. If you want to test us against the same question we are asking you to take to other vendors, that is the place to start.

Ready to apply this to your workflows?

Architech's AI Jumpstart is the structured entry point.