Reliability loop
Evaluators
Evaluators score model output. Atlas ships four kinds at v1 and accepts any combination on an evaluation run.
Evaluator kinds
Every evaluator implements run(input, predicted, expected?, item) → { score: 0..1, details? }. A score of 1 is a pass; 0 is a failure; anything in between is a soft result the evaluator decided to emit. We deliberately reject pass/fail booleans — soft scores let LLM judges and embedding matchers report confidence.
Regex
For binary structural checks. The classic use case is a PII pattern that must not appear in output.
typescriptawait client.evaluators.create({
name: "PII not present",
kind: "regex",
config: {
pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b",
must_match: false,
},
});JSON schema
For structured generation. Parse failure scores 0, parse-but-invalid scores 0.5 (the model produced JSON but missed a field), and a fully valid object scores 1.
typescriptawait client.evaluators.create({
name: "Output is valid invoice JSON",
kind: "json_schema",
config: {
schema: {
type: "object",
required: ["invoice_id", "amount", "due"],
properties: {
invoice_id: { type: "string" },
amount: { type: "number" },
due: { type: "string", format: "date" },
},
},
},
});LLM judge
For qualitative rubrics. The judge model receives input, predicted, and the rubric; we constrain its response with a JSON schema and parse { score, reasoning }. Use sparingly — judges are expensive and slower than mechanical evaluators.
typescriptawait client.evaluators.create({
name: "Tone matches house style",
kind: "llm_judge",
config: {
rubric: "Score 1 if the response is concise, technical, and avoids exclamation. Score 0 otherwise.",
judgeModel: "atlas-1",
},
});Embedding match
For paraphrase tolerance against a gold answer. Cosine similarity is normalized to 0..1 and compared against the threshold. Requires expected_output on each dataset item.
typescriptawait client.evaluators.create({
name: "Semantically equivalent to gold",
kind: "embedding_match",
config: {
threshold: 0.86,
embeddingModel: "atlas-embed-1",
},
});Running an evaluation
typescriptconst run = await client.evaluations.create({
dataset_id: ds.id,
evaluator_ids: ["ev_pii", "ev_schema", "ev_judge"],
});
// → run.status: "running"
const final = await client.evaluations.retrieve(run.id);
// → final.summaryScores: { overall: 0.91, per_evaluator: {...} }Small datasets (< 200 items) run synchronously. Larger runs are queued and you poll /api/v1/evaluations/{id} for status. Results persist per-item and are linked back to the originating dataset item.