IkshaConsulting
Insights
Engineering · 14 Mar 2026 · 5 min read

The eval harness is the product.

You don't have an AI feature. You have an eval harness with a model attached. Treat it that way and the rest gets easier.

AM
Aarav Mehta
Engineering Lead, Iksha

Every AI engagement we've shipped in the last two years has rotated, eventually, around the same artefact: a list of 800 to 1,500 graded cases, kept in a CSV in someone's repo, that decides whether anything else we've built is allowed to ship.

The model changes. The prompt template changes. The retrieval store changes. The eval set is what we keep. It is, in a way the team rarely says out loud, the product.

This is uncomfortable for two reasons. The first is that nobody buys an eval harness. They buy a chatbot, a co-pilot, a "Q&A over our docs" feature. The second is that an eval harness is unglamorous. It is a CSV. It is a Python script. It is a reviewer who graded 800 cases on a Saturday because we couldn't find anyone better.

The cost of not having one

The first six weeks of any RAG project go fine. The model is impressive on the demos you show your CEO. Then someone in the field asks the system a question that hits a citation it has never seen, the model confabulates, the customer's compliance team finds out, and the project is paused for three months while the legal team catches up.

This isn't an "AI is unreliable" story. It is an "we shipped a system without measuring it" story. We had no harness. We had no regression test. We had a vibe-check against the same fifteen examples the engineer who built the prototype was using on his laptop.

What a real harness looks like

Three things, in roughly this order:

  1. A graded golden set. 200 cases minimum, 800 if the domain is regulated. Each case has an input, an expected output, and a reason. The reason is what stops the set from drifting.
  2. Task-level metrics, not vibe metrics. "Did the system retrieve the correct policy?" beats "is the answer good?". One is testable. The other is a conversation.
  3. Regression tests in CI. Every PR runs the harness against a sampled subset. Bigger runs land on a nightly cron. If the score drops more than 2 points, the build fails.

That's it. There is no more advanced version of this. The labs that ship reliable AI features have this. The ones that don't, don't.

Where to put it on the team

The harness wants an owner. We've watched it die when it sits with the data scientist (who treats it as a research artefact), and when it sits with the platform engineer (who treats it as a CI nuisance). It belongs to the same person who would own a test suite for a payments system: a senior engineer who gets a tap on the shoulder when it goes red.

Six weeks in, the harness will tell you more about the future of the system than the model will. That's not a failure of the model. It's the harness doing its job.

If you're building an AI feature and you don't have a harness yet, send us a brief. We'll tell you whether you have a project or a research problem — for free.

Start a conversation

Tell us what
you're solving for.

One short conversation.
We'll tell you whether we're the right team — and if not, who is.