NEWMEN

The reliability loop

Six stages from call to next version.

Frontier labs treat API customers like a number. They don’t fix bad responses, they don’t take your corrections into training, and they don’t refund the calls they got wrong. Atlas inverts the contract. Your engineers correct production calls in place. The signal compounds. Per-operation. Per-tenant. Per-version. And if a call scores below your eval threshold — it isn’t metered.

Tag§01in-place correctionsEval§02gates + regressionsTrain§03per-customer goldenDeploy§04shadow → canary → prodthe reliability loopcontinuous · per-customer

Pass rate, per customer

The loop compounds.

Three production customers over 28 days. Eval pass rate rises as tagged calls become training signal. The eval-gate threshold at 85% is the ship condition.

50%70%90%d-27d-20d-13d-6d-0eval-gate · 85%northwind/prod88.7%halo/api92.6%lattice/ops90.2%

Synthetic data for illustration. Numbers representative of early pilot results.

How it works

Step by step.

1

Define the operation

An operation is a named, observable unit of work — `summarize_ticket`, `extract_invoice`, `classify_intent`. You can route traffic, build datasets, and gate releases per operation, never on the model as a whole.

await client.operations.create({
  key: "summarize_ticket",
  name: "Support ticket summary",
  ship_gates: [{ evaluator_id: "ev_regex_pii", min_score: 1.0 }],
});
2

Tag traffic at the call site

Every production request carries `metadata.operation_id`. Atlas attaches that label to the recorded call. There is no separate trace pipeline to build, no payload to re-serialize.

await client.chat.completions.create({
  model: "atlas-1",
  messages,
  metadata: { operation_id: "summarize_ticket" },
});
3

Review and tag in the console

Engineers open calls in the console, rate them thumbs-up / thumbs-down, or write a corrected output. No CSV export. No re-upload. The correction lives next to the original call forever.

// or programmatically:
await client.feedback.create({
  call_id: "chatcmpl-…",
  rating: "thumbs_down",
  correction: "Summary should be one sentence, not three.",
});
4

Build a golden dataset

Filter tagged calls by operation, rating, and date range. Promote the selection into a draft dataset. Each dataset has a version number — every promotion is auditable.

const ds = await client.datasets.create({
  operation_id: "summarize_ticket",
  name: "Tickets v3",
});
await client.datasets.items.add(ds.id, items);
5

Run evaluators and ship gates

Bind evaluators — regex, JSON-schema, LLM-judge, embedding-match — to the operation. Configure ship gates with minimum scores. Eval runs report per-item and aggregate results.

const evalRun = await client.evaluations.create({
  dataset_id: ds.id,
  evaluator_ids: ["ev_regex_pii", "ev_judge_quality"],
});
// → summaryScores.overall = 0.94
6

Promote — or request training

Promote the dataset to golden when every gate passes; a failed gate returns `400 ship_gates_unmet` with the specific failures. Once golden, request training and a solutions engineer responds within one business day.

await client.datasets.promote(ds.id);
// → status: "golden"

await client.datasets.training(ds.id, {
  notes: "Quarterly retrain.",
});

Talk to a solutions engineer

Atlas is sold to teams who commit to meaningful production volume. That commitment unlocks the reliability loop.