Benchmarks

Same task. 67% less.

Five representative public benchmarks: summarization, classification, structured extraction, code review, RAG-grounded QA. Atlas-1 Standard tier vs each task’s frontier baseline. Methodology, harness, and raw outputs are published — clone the repo and reproduce them yourself.

See the leaderboard →How we evaluate

Per-task results

Cost-savings, not leaderboard scores.

Each row pits the frontier baseline against Atlas-1 on a representative public dataset. The evaluator decides whether quality held; the cost column shows what you'd pay for the same workload.

Total savings

$26.75

71.7% on 3,580 prompts

Baseline spend

$37.31

Frontier baselines, direct

Atlas spend

$10.56

avg eval 0.756 vs baseline 0.786 (-3.8%)

Task	Baseline	Atlas-1 routed to	Cost	Eval score	Savings
Summarization · XSum 200-article subset of the XSum corpus. Score is ROUGE-1 F1 against gold summaries.	openai/gpt-5.5	meta-llama/llama-3.1-70b-instructq8 · Hyperbolic	$11.42$1.36	0.4370.405	88%
Classification · Banking77 Full Banking77 test set — 3 080 customer-service intents across 77 classes. Score is accuracy@1.	anthropic/claude-sonnet-4.6	qwen/qwen-2.5-72b-instructq4 · DeepInfra	$12.38$3.94	0.9020.886	68%
Structured extraction · Invoices 50 invoice fixtures, json-schema validator measures pass rate at exact field match.	openai/gpt-5.5	deepseek/deepseek-chat-v3q8 · Together	$1.95$0.21	0.9800.940	89%
Code review · PR comments 100 redacted GitHub PRs with expected reviewer comments. LLM-judge score against gold rubric.	anthropic/claude-opus-4.7	qwen/qwen-2.5-coder-32b-instructq8 · Hyperbolic	$5.14$4.10	0.8530.821	20%
RAG-grounded QA · Natural Questions 150 NQ open-domain questions with retrieved-passage context. EM + F1 token overlap.	openai/gpt-5.5	meta-llama/llama-3.1-70b-instructq8 · Hyperbolic	$6.42$0.95	0.7560.728	85%

Methodology

Each row is pnpm benchmarks:run --task <slug>. The harness runs temperature=0 against the public dataset, scores with the listed evaluator, and emits a JSONL of inputs/outputs/scores. Atlas-1 is run with tier: standard and no eval-history priming — these are first-call results.

Reproduce it

Clone newmen/benchmarks and run pnpm benchmarks:run --all with your own Newmen API key. Numbers should match within token-count noise. Drift is a bug; please file an issue and we’ll re-snap the table.

Run on the live Newmen API; snapshot refreshed weekly. Per-call cost variances vary with prompt length and time of day — these are aggregates over the listed sample sizes. Real per-workload savings are best measured with the comparison runner on your own prompts.

Routing evaluation

What Atlas measures per call

Before routing to any provider, Atlas has a prior on these signals. As your operations accumulate history, the prior becomes personalized to your traffic.

Signal	Kind	Description
Operation accuracy	Pass rate	Per-operation pass rate on the customer's golden dataset, measured before and after each routing or adapter change.
Latency p50 / p95	Latency	Time-to-first-token measured at the Newmen API layer, per operation and per provider. Used to weight routing decisions.
Calibrated refusal rate	Safety	Fraction of under-specified prompts that return a calibrated refusal rather than a fabricated answer. Internal target: 95%.
Schema compliance	Structure	JSON-schema pass rate on structured output operations. Tracked separately from semantic accuracy because schema violations often surface before accuracy regressions.

Adapter evaluation

What gates a per-operation adapter release

Every adapter goes through the same ship-gate system customers use for dataset promotion. An adapter does not deploy until every gate passes.

Check	Kind	Description
Ship gate pass rate	Promotion gate	The evaluator score on the customer's golden dataset must meet or exceed every configured minimum score before an adapter is promoted.
Regression check	Regression	The adapter is run against the full golden dataset from all prior versions. Any operation that was passing must continue to pass.
Latency delta	Latency	Adapters are substantially smaller than the base model. We measure the TTFT improvement and verify it does not exceed the base model's p95.
Cost per 1M tokens	Cost	Because adapters are smaller, they cost less per call. We track the cost curve per operation over time as a secondary measure of loop health.

Your improvement curve is visible in the console under each operation. Every eval run is versioned and auditable. The methodology used for any evaluation is the evaluator config you wrote, not something we define for you.

Methodology

What we measure inside the Atlas router

Per-call signals Atlas conditions on when picking the cheapest passing path for an operation. Customer-bound evaluators run on top of these in the reliability loop.

Want the technical detail?

Read the training overview.

The research note on Atlas-1 training covers the base model architecture, pre-training data composition, and the adapter training methodology in technical detail.

Atlas-1 training overview →

See it in action

Your eval curve, in the console.

Sign up, tag your first operation, run an evaluator, and see your pass rate over time. No demo data — your actual production traffic.

Get started →