NEWMEN

Blog

Ship gates caught a regression we didn't know we'd introduced

2026-04-15 · 3 min read

We use our own eval system to gate Atlas adapter releases. Here is a case where it actually worked.

We dogfood the reliability loop internally on Atlas adapter releases. When we are preparing a new adapter checkpoint for production, we run it through the same evaluator and ship-gate setup our customers use. In February, the gates caught something we would have shipped.

What happened

We were preparing an adapter that had been trained with updated correction data — a batch we had curated from internal usage over the previous six weeks. The batch had a heavier representation of multi-turn conversations than prior batches. The adapter scored well on our standard evals. We moved it to the pre-promotion stage, ran the full eval suite against our internal datasets, and one evaluator failed.

The failing evaluator was a JSON-schema validator on a structured-extraction operation. The operation extracts invoice line items into a fixed schema: an array of objects with description, quantity, unit_price, and amount. The adapter was outputting an additional currency_code field in roughly 12% of responses. The field wasn't in the schema, so the JSON-schema evaluator scored those responses as 0.5 (valid JSON, invalid schema). The aggregate score on that evaluator dropped to 0.89, below the 0.95 gate.

The gate blocked promotion. We pulled the adapter. We went back to the training data.

What the data showed

Looking at the correction batch, we found a cluster of multi-turn conversations where a reviewer had corrected an earlier response to include currency_code. The reviewer was right in context — those conversations were about international invoices and currency_code was useful information. But the correction was not tagged to any operation, so it was included in the general training batch, and the adapter generalized from it.

This is the problem with using unconstrained correction signal. A correction that is appropriate in one context can be a regression in another context that expects a stricter schema.

What we changed

We added a field in our internal annotation interface that requires reviewers to tag corrections with a scope. A correction tagged as scoped does not go into the general training batch — it goes into an operation-specific dataset. Corrections tagged as general are reviewed by a second person before they are included in the base batch.

This is a manual process. We do not have a great automated way to detect context-contamination in training data yet. But having the ship gate fail gave us a concrete failure to reason backwards from, which is different from discovering the regression after deployment.

The thing about catching your own bugs

There is something slightly uncomfortable about writing this post. We are describing a mistake we made and caught. The intended read is: the system works, we found it before it shipped. But the system working does not mean the underlying issue — unconstrained correction signal bleeding across operations — was not a real problem. It was.

We shipped without this bug because the gate caught it. That is correct. We would prefer not to have introduced the bug in the first place. The process improvement on correction scoping is the more important part of this story.

If you are running the reliability loop on your own operations, the same failure mode is possible in your data. If a correction that was appropriate in one context is generalizing to a context where it is not, your evaluators will catch it — but only if your evaluators cover the right operations. Coverage matters as much as thresholds.

The gates work. Write the evaluators first.