Research

Reliability evaluations

2026-05-10 · 4 min read

The four-class evaluator framework we use internally and ship to customers, with notes on calibration and failure modes.

We treat reliability evaluation as a first-class product surface. This note describes the framework we use internally at Newmen and ship to customers in the form of the four evaluator kinds available at v1: regex, JSON-schema, LLM-judge, and embedding-match. We use the same framework on our own model releases. We do not believe in a separate "internal eval" surface that is more sophisticated than what we make available to customers — the asymmetry would be exactly the kind of opacity we wrote our values against.

What an evaluator is

An evaluator is a deterministic-ish function that takes an input, a model prediction, and an optional expected output, and returns a score in the range zero to one with optional structured detail. A score of one is a pass. Zero is a failure. Anything in between is a soft result the evaluator decided to emit.

We deliberately rejected pass/fail booleans during design. Soft scores let LLM judges and embedding matchers report confidence rather than a coin flip at the threshold. Soft scores also play more cleanly with ship gates, which set a minimum threshold per evaluator binding.

The four kinds

Regex

The simplest evaluator. A pattern and a must_match flag. The classic production use case is checking that a PII pattern does not appear in the output. Regex evaluators are cheap, deterministic, and impossible to misinterpret. We recommend customers start every operation's evaluator suite with at least one regex evaluator covering disallowed content.

JSON schema

For structured generation tasks. Parse failure scores zero. A parse that succeeds but fails schema validation scores 0.5 — partial credit, because the model produced JSON but missed a required field. A fully valid object scores one. The intermediate score is deliberate. A model that produces almost-valid JSON is materially different from one that produces prose. Customers consistently want to surface that distinction.

LLM judge

For qualitative rubrics. The judge model receives the input, the prediction, and a rubric, and is constrained with a JSON schema to return { score, reasoning }. Several design notes here.

We default to using Atlas-1 as the judge. This is not because Atlas is the best at judging — it is because using a comparable peer minimizes drift between what the judge considers good and what the production model considers good. Customers can override the judge model.

LLM judges are expensive and slow. We recommend using them sparingly, primarily for tone-and-style scoring or qualitative correctness that mechanical evaluators cannot capture. Avoid LLM judges for things a regex or schema can express more cleanly.

We strongly recommend calibrating an LLM judge against human labels at least once per quarter. The console exposes a calibration view that compares judge scores against a held-out set of human-labeled items. Drift between judge and human is a warning sign worth investigating before it shows up in a ship-gate failure.

Embedding match

For paraphrase tolerance against a gold answer. The evaluator embeds both prediction and expected output, computes cosine similarity, and normalizes to zero-to-one against the threshold. Embedding match requires expected_output on each dataset item.

We caution customers against using embedding match as the only evaluator on an operation. It works well as a secondary check — "the answer is at least semantically in the right neighborhood" — but it has well-known failure modes for negation and small but consequential errors. Pair it with a more discriminating evaluator.

Ship gates

Evaluators bind to operations via ship gates. A ship gate names an evaluator and a minimum score. Promoting a dataset to golden runs all bound evaluators against the dataset and checks every gate. A failed gate returns the structured error documented at /docs/datasets.

We treat ship gates as a contract. The customer commits to a quality bar in evaluator-and-score terms; the platform enforces that contract at promotion time. The contract is auditable. The evaluator config is versioned and the score history is queryable.

Failure modes we have seen

A short, opinionated list, written for customers about to build their first eval suite.

Evaluator over-fitting. The most common failure: an evaluator passes everything because its rubric or pattern is too lenient. Mitigation: include known-bad examples in your dataset and confirm the evaluator scores them low. We expose a "negative item" tag for this purpose.

Judge drift. The LLM judge starts disagreeing with humans. Mitigation: the calibration view, ran quarterly at minimum.

Gate creep. Customers initially set conservative ship gates, then loosen them when promotion blocks shipping. Mitigation: track the historical pass rate of every gate. The console surfaces "this gate has not failed in ninety days" — a signal to either tighten it or remove it.

Untagged operations. Calls without metadata.operation_id cannot be evaluated. This is the most common production bug we see. Mitigation: console alerts on operations with sudden drops in tagged-call volume.

What we report on Atlas releases

Every Atlas release is gated on its own evaluation suite. We publish per-operation scores, per-evaluator scores, and notable changes (positive and negative) in the release notes. We treat regressions on customer-facing operations as launch blockers; we have delayed releases for them and we will continue to.

The connection to prompt engineering

Prompt engineering is an attempt to move a model's behavior without being able to measure whether you succeeded. You adjust the prompt, test informally, ship, and find out later that production traffic had a different distribution than your test cases.

The evaluator framework is the alternative. You describe what good looks like — in a regex, a schema, a rubric, or a similarity threshold — and you measure against production traffic. The measurement drives the correction. The correction becomes training signal. The loop closes on data, not intuition.

You still have to write the evaluators. That is the work we believe belongs to you. Everything downstream of that — routing, adapter training, deployment gating — is what Newmen handles.

If you have specific questions about the methodology or want to discuss your eval suite, write to research@newmen.ai. We are happy to look at concrete operations under NDA.

← All research