Reliability loop
Operations
An operation is a named, observable unit of work — the primary axis by which production traffic is divided. Datasets, evaluators, and ship gates are all scoped to an operation.
What is an operation
An operation maps to one stable prompt-template-plus-task combination in your application: classify_intent, extract_invoice, generate_followup_email. Its key is an immutable slug you pass in every API call. Everything downstream in the reliability loop — tagging, datasets, evaluators, ship gates, training — is scoped to this key.
Operations also carry an output schema that defines what a correct response looks like structurally, and a schema version that tracks when that structure changes. Together these two axes let you reason about two different kinds of drift: structural changes to the output format, and requirement changes that change what “correct” means without changing the format.
operation_id is mandatory
metadata.operation_id. Calls without one are recorded and billed normally but cannot be filtered into datasets, evaluated, or used for training. The most common production bug is forgetting to set it on a new code path.typescript// Every call must carry operation_id.
// The key auto-registers if it doesn't exist yet.
await client.chat.completions.create({
model: "atlas-1",
messages,
metadata: { operation_id: "summarize_ticket" },
});Auto-registration vs formal definition
Passing a new operation_id in metadata auto-registers a lightweight operation entry. You can start tagging traffic immediately, before deciding on schema or evaluators.
Formally defining the operation — via the console or API — is required before you can:
- Attach an output schema
- Configure evaluators and ship gates
- Promote a dataset to golden
- Request training
typescript · minimal (no schema)// Formally define the operation with a name and description.
// Required before you can attach evaluators or ship gates.
const op = await client.operations.create({
key: "summarize_ticket",
name: "Support ticket summary",
description: "One-sentence summary fed into routing. Must not contain PII.",
});Add an output schema when your operation’s response has a defined structure. The schema is used to validate dataset items and to scope evaluators to the correct response shape.
typescript · with output schema// Add an output_schema to lock down the expected structure.
// Evaluators and datasets are validated against this schema.
const op = await client.operations.create({
key: "summarize_ticket",
name: "Support ticket summary",
description: "One-sentence summary fed into routing. Must not contain PII.",
output_schema: {
type: "object",
properties: {
summary: { type: "string", maxLength: 200 },
priority: { type: "string", enum: ["low", "medium", "high"] },
},
required: ["summary", "priority"],
},
});Output schema
The output_schema field is a JSON Schema object describing what a structurally correct response looks like. It serves two purposes:
- Validation — dataset items whose
expected_outputdoes not conform to the schema are rejected on ingest. - Evaluator scope — the JSON-schema evaluator kind validates model output against this schema automatically. Other evaluator kinds (regex, LLM-judge, embedding-match) can reference specific fields within the schema.
Omit output_schema for free-text operations where the only quality signal is semantic correctness, not structural conformance.
Schema changes
A schema change is a structural change to the output format: adding or removing a field, changing a type, narrowing an enum. When this happens, old datasets may be incompatible with evaluators written for the new shape.
The correct response is to bump schema_version on the operation and start tagging new production calls with it. Old calls remain queryable under their original version. New datasets target the new version.
typescript · update operation schema// v1 schema: { summary, priority }
// v2 schema adds confidence. Old datasets remain valid for v1 evaluators;
// new evaluators targeting v2 won't apply to v1 calls.
await client.operations.update("op_abc123", {
output_schema: {
type: "object",
properties: {
summary: { type: "string", maxLength: 200 },
priority: { type: "string", enum: ["low", "medium", "high"] },
confidence: { type: "number", minimum: 0, maximum: 1 },
},
required: ["summary", "priority", "confidence"],
},
schema_version: "2",
});typescript · tag calls with schema_version// When output_schema changes, bump schema_version on calls
// so production traffic is bucketed by the schema it was generated against.
await client.chat.completions.create({
model: "atlas-1",
messages,
metadata: {
operation_id: "summarize_ticket",
schema_version: "2", // new field: confidence
},
});You can filter calls by schema_version in the console and in dataset creation, so traffic from the old and new schemas never pollutes the same dataset.
Requirement changes
A requirement change is different: the output structure stays the same, but what “correct” means has changed. Examples:
- You tightened the quality rubric — outputs now need higher confidence or stricter wording
- You changed the system prompt and edge-case handling has shifted
- A business rule changed what the correct priority classification is
- You found a class of subtle errors not caught by the original evaluator
For requirement changes, do not bump schema_version. The existing dataset items are structurally valid — the same inputs and expected outputs still apply to the same schema. Instead, update the evaluator rubric and re-run evaluation against the existing dataset. The new rubric will produce different scores for the same items, revealing whether the dataset still meets the bar.
typescript · requirement change workflow// Requirement change: same schema, tighter quality bar.
// Do NOT bump schema_version — the output structure is unchanged.
// Instead, update the evaluator rubric and re-evaluate existing datasets.
await client.evaluators.update("ev_judge_quality", {
config: {
rubric: `Score 0–1. A score of 1 requires:
- Summary is one sentence, ≤ 200 characters
- No PII (names, emails, account numbers)
- Priority matches the ticket urgency (high = blocker, medium = degraded, low = cosmetic)
- No filler phrases ("the customer is experiencing…")
Deduct 0.3 per violation.`,
},
});
// Then re-run the evaluation — existing dataset items are still structurally
// valid; the new rubric simply applies stricter criteria to the same inputs.
const evaluation = await client.evaluations.create({
dataset_id: ds.id,
evaluator_ids: ["ev_judge_quality"],
});If re-evaluation shows the existing dataset no longer meets the tightened bar, you have two options: raise the ship gate’s min_score to reflect the new bar, or add more items to the dataset that demonstrate the new expected behavior before re-promoting.
Ship gates
Ship gates are per-operation rules that block dataset promotion until quality thresholds are met. Each gate binds an evaluator to a minimum score.
typescriptawait client.operations.create({
key: "summarize_ticket",
name: "Support ticket summary",
output_schema: { /* ... */ },
ship_gates: [
{ evaluator_id: "ev_regex_pii", min_score: 1.0 }, // zero tolerance
{ evaluator_id: "ev_judge_quality", min_score: 0.88 }, // tightened from 0.85
],
});Promotion is atomic: all gates must pass. A failed gate raises ShipGatesUnmetError with the failing gates and their actual scores attached:
json// Promotion blocked — response body:
{
"error": "ship_gates_unmet",
"failedGates": [
{
"evaluator_id": "ev_judge_quality",
"score": 0.83,
"min_score": 0.88
}
]
}The standard CI pattern is to run an evaluation on every release branch and block promotion — and therefore training — until all gates pass. Gates are intended to be tightened over time as the model improves, never loosened.