Reliability loop
Datasets
Datasets are versioned, operation-scoped collections of input-output pairs. Promoting one to golden is the gate to per-tenant training.
Create a dataset
typescriptconst ds = await client.datasets.create({
operation_id: "summarize_ticket",
name: "Tickets v3 — golden candidate",
});A new dataset starts in draft status with version 1. Every promotion increments the version. Versions are immutable; if you need to edit a golden dataset, branch from it into a new draft.
Add items
typescriptawait client.datasets.items.add(ds.id, [
{
source_call_id: "chatcmpl-abc123",
input: { messages: [{ role: "user", content: "..." }] },
expected_output: { summary: "Customer reports billing error on invoice #4421.", priority: "high" },
weight: 1.0,
},
]);Items can be created from tagged production calls (most common) or from scratch. source_call_id is optional but recommended — it lets the console link items back to the original call for review.
operation_id on their source call. Calls recorded without one cannot be filtered into datasets or used for training.Schema version scoping
Datasets should target a single schema_version. When an operation’s output_schema changes structurally — a new field, a removed field, a changed type — the new schema_version is passed in API call metadata. Filter datasets to that version so evaluators written for the new shape are never tested against old traffic.
typescript// Filter production calls to only the current schema version
// when building the dataset — never mix v1 and v2 shapes.
const ds = await client.datasets.create({
operation_id: "summarize_ticket",
name: "Tickets v3 — schema v2",
schema_version: "2", // only calls tagged with schema_version: "2" are eligible
});Old datasets remain valid for evaluators scoped to their original version. New evaluators targeting the new schema will not apply to older calls.
Promote to golden
typescripttry {
await client.datasets.promote(ds.id);
} catch (e) {
if (e instanceof ShipGatesUnmetError) {
console.error("Gates failed:", e.failedGates);
// e.failedGates: [{ evaluator_id, score, min_score }]
}
throw e;
}Promotion is ship-gate enforced. The latest evaluation against the dataset is checked; every gate on the operation must pass. A failed gate raises ShipGatesUnmetError (TS) or ShipGatesUnmet (Python) with .failedGates attached. Gates are intended to be tightened over time, never loosened.
Requirement changes
When the quality bar changes but the output structure stays the same — a tighter rubric, a new business rule, edge cases that weren’t caught before — do not create a new dataset. The existing items are structurally valid. Instead, update the evaluator rubric and re-run evaluation against the existing dataset.
typescript · requirement change workflow// Requirement changed — tighter quality bar, same output structure.
// Update the evaluator rubric, then re-run evaluation against the existing dataset.
await client.evaluators.update("ev_judge_quality", {
config: {
rubric: `Score 0–1. A score of 1 requires:
- Summary is one sentence, ≤ 200 characters
- No PII (names, emails, account numbers)
- Priority matches urgency (high = blocker, medium = degraded, low = cosmetic)
Deduct 0.3 per violation.`,
},
});
const evaluation = await client.evaluations.create({
dataset_id: ds.id,
evaluator_ids: ["ev_judge_quality"],
});
// If evaluation now passes the tightened bar, promote as normal.
// If it fails, either raise min_score on the ship gate to match
// the new bar, or add more items demonstrating the new expected behavior.Request training
typescriptawait client.datasets.training(ds.id, {
notes: "Q3 retrain. Tag the new modelVersion for routing on summarize_ticket.",
});Once a dataset is golden, you may request training. Training is sales-gated: requests create a row in our queue, a solutions engineer responds within one business day, and the resulting model version is registered against your organization. Training itself is performed offline; you will not be billed for failed pipeline runs.