Reliability Loop (Pro)

Operations

Advanced, opt-in. An operation is a named, observable unit of work — the primary axis by which production traffic is divided. Datasets, evaluators, and ship gates are all scoped to an operation.

Note.The Reliability Loop is an opt-in Pro capability — it is not the default. Pay-as-you-go Atlas mode (cheaper-than-direct from call #1, the served model on every call, thumbs-down refunds) needs none of this. Reach for operations only when you want per-operation tuning and eval-gated auto-refund on top.

What is an operation

An operation maps to one stable prompt-template-plus-task combination in your application: classify_intent, extract_invoice, generate_followup_email. Its key is an immutable slug you pass in every API call. Everything downstream in the Reliability Loop — tagging, datasets, evaluators, ship gates, opt-in tuning — is scoped to this key.

Operations also carry an output schema that defines what a correct response looks like structurally, and a schema version that tracks when that structure changes. Together these two axes let you reason about two different kinds of drift: structural changes to the output format, and requirement changes that change what “correct” means without changing the format.

operation_id is mandatory

Required.Every call you want inside the Reliability Loop must include metadata.operation_id. Calls without one are recorded and billed normally — and still get Atlas-mode pricing and thumbs-down refunds — but cannot be filtered into datasets, evaluated, or used for opt-in tuning. The most common Pro setup bug is forgetting to set it on a new code path.

typescript// Every call must carry operation_id.
// The key auto-registers if it doesn't exist yet.
await client.chat.completions.create({
  model: "atlas-1",
  messages,
  metadata: { operation_id: "summarize_ticket" },
});

Auto-registration vs formal definition

Passing a new operation_id in metadata auto-registers a lightweight operation entry. You can start tagging traffic immediately, before deciding on schema or evaluators.

Formally defining the operation — via the console or API — is required before you can:

Attach an output schema
Configure evaluators and ship gates
Promote a reviewed dataset
Request opt-in tuning

typescript · minimal (no schema)// Formally define the operation with a name and description.
// Required before you can attach evaluators or ship gates.
const op = await client.operations.create({
  key: "summarize_ticket",
  name: "Support ticket summary",
  description: "One-sentence summary fed into routing. Must not contain PII.",
});

Add an output schema when your operation’s response has a defined structure. The schema is used to validate dataset items and to scope evaluators to the correct response shape.

typescript · with output schema// Add an output_schema to lock down the expected structure.
// Evaluators and datasets are validated against this schema.
const op = await client.operations.create({
  key: "summarize_ticket",
  name: "Support ticket summary",
  description: "One-sentence summary fed into routing. Must not contain PII.",
  output_schema: {
    type: "object",
    properties: {
      summary:  { type: "string", maxLength: 200 },
      priority: { type: "string", enum: ["low", "medium", "high"] },
    },
    required: ["summary", "priority"],
  },
});

Output schema

The output_schema field is a JSON Schema object describing what a structurally correct response looks like. It serves two purposes:

Validation — dataset items whose expected_output does not conform to the schema are rejected on ingest.
Evaluator scope — the JSON-schema evaluator kind validates model output against this schema automatically. Other evaluator kinds (regex, LLM-judge, embedding-match) can reference specific fields within the schema.

Omit output_schema for free-text operations where the only quality signal is semantic correctness, not structural conformance.

Schema changes

A schema change is a structural change to the output format: adding or removing a field, changing a type, narrowing an enum. When this happens, old datasets may be incompatible with evaluators written for the new shape.

The correct response is to bump schema_version on the operation and start tagging new production calls with it. Old calls remain queryable under their original version. New datasets target the new version.

typescript · update operation schema// v1 schema: { summary, priority }
// v2 schema adds confidence. Old datasets remain valid for v1 evaluators;
// new evaluators targeting v2 won't apply to v1 calls.
await client.operations.update("op_abc123", {
  output_schema: {
    type: "object",
    properties: {
      summary:    { type: "string", maxLength: 200 },
      priority:   { type: "string", enum: ["low", "medium", "high"] },
      confidence: { type: "number", minimum: 0, maximum: 1 },
    },
    required: ["summary", "priority", "confidence"],
  },
  schema_version: "2",
});

typescript · tag calls with schema_version// When output_schema changes, bump schema_version on calls
// so production traffic is bucketed by the schema it was generated against.
await client.chat.completions.create({
  model: "atlas-1",
  messages,
  metadata: {
    operation_id: "summarize_ticket",
    schema_version: "2",           // new field: confidence
  },
});

You can filter calls by schema_version in the console and in dataset creation, so traffic from the old and new schemas never pollutes the same dataset.

Note.Schema version is a routing and filtering tool, not a compatibility guarantee. Evaluators are written against a specific schema version. Promote datasets and evaluators together when bumping versions to avoid gates checking v1 outputs against v2 expectations.

Requirement changes

A requirement change is different: the output structure stays the same, but what “correct” means has changed. Examples:

You tightened the quality rubric — outputs now need higher confidence or stricter wording
You changed the system prompt and edge-case handling has shifted
A business rule changed what the correct priority classification is
You found a class of subtle errors not caught by the original evaluator

For requirement changes, do not bump schema_version. The existing dataset items are structurally valid — the same inputs and expected outputs still apply to the same schema. Instead, update the evaluator rubric and re-run evaluation against the existing dataset. The new rubric will produce different scores for the same items, revealing whether the dataset still meets the bar.

typescript · requirement change workflow// Requirement change: same schema, tighter quality bar.
// Do NOT bump schema_version — the output structure is unchanged.
// Instead, update the evaluator rubric and re-evaluate existing datasets.
await client.evaluators.update("ev_judge_quality", {
  config: {
    rubric: `Score 0–1. A score of 1 requires:
  - Summary is one sentence, ≤ 200 characters
  - No PII (names, emails, account numbers)
  - Priority matches the ticket urgency (high = blocker, medium = degraded, low = cosmetic)
  - No filler phrases ("the customer is experiencing…")
  Deduct 0.3 per violation.`,
  },
});

// Then re-run the evaluation — existing dataset items are still structurally
// valid; the new rubric simply applies stricter criteria to the same inputs.
const evaluation = await client.evaluations.create({
  dataset_id: ds.id,
  evaluator_ids: ["ev_judge_quality"],
});

If re-evaluation shows the existing dataset no longer meets the tightened bar, you have two options: raise the ship gate’s min_score to reflect the new bar, or add more items to the dataset that demonstrate the new expected behavior before re-promoting.

Note.The distinction matters for tuning-signal quality. Schema-version tags keep a mix of v1 and v2 response shapes out of the same tuning set. Requirement changes leave version unchanged so the full production history of the operation contributes, not just calls made after the rubric was tightened.

Ship gates

Ship gates are per-operation rules that block dataset promotion until quality thresholds are met. Each gate binds an evaluator to a minimum score.

typescriptawait client.operations.create({
  key: "summarize_ticket",
  name: "Support ticket summary",
  output_schema: { /* ... */ },
  ship_gates: [
    { evaluator_id: "ev_regex_pii",     min_score: 1.0  },  // zero tolerance
    { evaluator_id: "ev_judge_quality", min_score: 0.88 },  // tightened from 0.85
  ],
});

Promotion is atomic: all gates must pass. A failed gate raises ShipGatesUnmetError with the failing gates and their actual scores attached:

json// Promotion blocked — response body:
{
  "error": "ship_gates_unmet",
  "failedGates": [
    {
      "evaluator_id": "ev_judge_quality",
      "score": 0.83,
      "min_score": 0.88
    }
  ]
}

The standard CI pattern is to run an evaluation on every release branch and block promotion — and therefore opt-in tuning — until all gates pass. Gates are intended to be tightened over time as quality improves, never loosened.