Reliability Loop (Pro)

Evaluators

Advanced, opt-in. Evaluators score model output. Atlas ships four kinds at v1 and accepts any combination on an evaluation run.

Note.Evaluators are part of the opt-in Reliability Loop (Pro) — not the default. Pay-as-you-go Atlas mode (cheaper than direct from call #1, thumbs-down refunds) needs none of this. Add evaluators only when you want eval-gated auto-refund and opt-in tuning on an operation.

Evaluator kinds

Every evaluator implements run(input, predicted, expected?, item) → { score: 0..1, details? }. A score of 1 is a pass; 0 is a failure; anything in between is a soft result the evaluator decided to emit. We deliberately reject pass/fail booleans — soft scores let LLM judges and embedding matchers report confidence.

Regex

For binary structural checks. The classic use case is a PII pattern that must not appear in output.

typescriptawait client.evaluators.create({
  name: "PII not present",
  kind: "regex",
  config: {
    pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b",
    must_match: false,
  },
});

JSON schema

For structured generation. Parse failure scores 0, parse-but-invalid scores 0.5 (the model produced JSON but missed a field), and a fully valid object scores 1.

typescriptawait client.evaluators.create({
  name: "Output is valid invoice JSON",
  kind: "json_schema",
  config: {
    schema: {
      type: "object",
      required: ["invoice_id", "amount", "due"],
      properties: {
        invoice_id: { type: "string" },
        amount: { type: "number" },
        due: { type: "string", format: "date" },
      },
    },
  },
});

LLM judge

For qualitative rubrics. The judge model receives input, predicted, and the rubric; we constrain its response with a JSON schema and parse { score, reasoning }. Use sparingly — judges are expensive and slower than mechanical evaluators.

typescriptawait client.evaluators.create({
  name: "Tone matches house style",
  kind: "llm_judge",
  config: {
    rubric: "Score 1 if the response is concise, technical, and avoids exclamation. Score 0 otherwise.",
    judgeModel: "atlas-1",
  },
});

Embedding match

For paraphrase tolerance against a reference answer. Cosine similarity is normalized to 0..1 and compared against the threshold. Requires expected_output on each dataset item.

typescriptawait client.evaluators.create({
  name: "Semantically equivalent to reference",
  kind: "embedding_match",
  config: {
    threshold: 0.86,
    embeddingModel: "atlas-embed-1",
  },
});

Running an evaluation

typescriptconst run = await client.evaluations.create({
  dataset_id: ds.id,
  evaluator_ids: ["ev_pii", "ev_schema", "ev_judge"],
});
// → run.status: "running"

const final = await client.evaluations.retrieve(run.id);
// → final.summaryScores: { overall: 0.91, per_evaluator: {...} }

Small datasets (< 200 items) run synchronously. Larger runs are queued and you poll /api/v1/evaluations/{id} for status. Results persist per-item and are linked back to the originating dataset item.