NEWMEN

Reliability loop

Evaluators

Evaluators score model output. Atlas ships four kinds at v1 and accepts any combination on an evaluation run.

Evaluator kinds

Every evaluator implements run(input, predicted, expected?, item) → { score: 0..1, details? }. A score of 1 is a pass; 0 is a failure; anything in between is a soft result the evaluator decided to emit. We deliberately reject pass/fail booleans — soft scores let LLM judges and embedding matchers report confidence.

Regex

For binary structural checks. The classic use case is a PII pattern that must not appear in output.

typescriptawait client.evaluators.create({
  name: "PII not present",
  kind: "regex",
  config: {
    pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b",
    must_match: false,
  },
});

JSON schema

For structured generation. Parse failure scores 0, parse-but-invalid scores 0.5 (the model produced JSON but missed a field), and a fully valid object scores 1.

typescriptawait client.evaluators.create({
  name: "Output is valid invoice JSON",
  kind: "json_schema",
  config: {
    schema: {
      type: "object",
      required: ["invoice_id", "amount", "due"],
      properties: {
        invoice_id: { type: "string" },
        amount: { type: "number" },
        due: { type: "string", format: "date" },
      },
    },
  },
});

LLM judge

For qualitative rubrics. The judge model receives input, predicted, and the rubric; we constrain its response with a JSON schema and parse { score, reasoning }. Use sparingly — judges are expensive and slower than mechanical evaluators.

typescriptawait client.evaluators.create({
  name: "Tone matches house style",
  kind: "llm_judge",
  config: {
    rubric: "Score 1 if the response is concise, technical, and avoids exclamation. Score 0 otherwise.",
    judgeModel: "atlas-1",
  },
});

Embedding match

For paraphrase tolerance against a gold answer. Cosine similarity is normalized to 0..1 and compared against the threshold. Requires expected_output on each dataset item.

typescriptawait client.evaluators.create({
  name: "Semantically equivalent to gold",
  kind: "embedding_match",
  config: {
    threshold: 0.86,
    embeddingModel: "atlas-embed-1",
  },
});

Running an evaluation

typescriptconst run = await client.evaluations.create({
  dataset_id: ds.id,
  evaluator_ids: ["ev_pii", "ev_schema", "ev_judge"],
});
// → run.status: "running"

const final = await client.evaluations.retrieve(run.id);
// → final.summaryScores: { overall: 0.91, per_evaluator: {...} }

Small datasets (< 200 items) run synchronously. Larger runs are queued and you poll /api/v1/evaluations/{id} for status. Results persist per-item and are linked back to the originating dataset item.