Benchmarks
Same task. 67% less.
Five representative public benchmarks: summarization, classification, structured extraction, code review, RAG-grounded QA. Atlas-1 Standard tier vs each task’s frontier baseline. Methodology, harness, and raw outputs are published — clone the repo and reproduce them yourself.
Per-task results
Cost-savings, not leaderboard scores.
Each row pits the frontier baseline against Atlas-1 on a representative public dataset. The evaluator decides whether quality held; the cost column shows what you'd pay for the same workload.
Total savings
$26.75
71.7% on 3,580 prompts
Baseline spend
$37.31
Frontier baselines, direct
Atlas spend
$10.56
avg eval 0.756 vs baseline 0.786 (-3.8%)
| Task | Baseline | Atlas-1 routed to | Cost | Eval score | Savings |
|---|---|---|---|---|---|
Summarization · XSum 200-article subset of the XSum corpus. Score is ROUGE-1 F1 against gold summaries. | openai/gpt-5.5 | meta-llama/llama-3.1-70b-instructq8 · Hyperbolic | $11.42$1.36 | 0.4370.405 | 88% |
Classification · Banking77 Full Banking77 test set — 3 080 customer-service intents across 77 classes. Score is accuracy@1. | anthropic/claude-sonnet-4.6 | qwen/qwen-2.5-72b-instructq4 · DeepInfra | $12.38$3.94 | 0.9020.886 | 68% |
Structured extraction · Invoices 50 invoice fixtures, json-schema validator measures pass rate at exact field match. | openai/gpt-5.5 | deepseek/deepseek-chat-v3q8 · Together | $1.95$0.21 | 0.9800.940 | 89% |
Code review · PR comments 100 redacted GitHub PRs with expected reviewer comments. LLM-judge score against gold rubric. | anthropic/claude-opus-4.7 | qwen/qwen-2.5-coder-32b-instructq8 · Hyperbolic | $5.14$4.10 | 0.8530.821 | 20% |
RAG-grounded QA · Natural Questions 150 NQ open-domain questions with retrieved-passage context. EM + F1 token overlap. | openai/gpt-5.5 | meta-llama/llama-3.1-70b-instructq8 · Hyperbolic | $6.42$0.95 | 0.7560.728 | 85% |
Methodology
Each row is pnpm benchmarks:run --task <slug>. The harness runs temperature=0 against the public dataset, scores with the listed evaluator, and emits a JSONL of inputs/outputs/scores. Atlas-1 is run with tier: standard and no eval-history priming — these are first-call results.
Reproduce it
Clone newmen/benchmarks and run pnpm benchmarks:run --all with your own Newmen API key. Numbers should match within token-count noise. Drift is a bug; please file an issue and we’ll re-snap the table.
Run on the live Newmen API; snapshot refreshed weekly. Per-call cost variances vary with prompt length and time of day — these are aggregates over the listed sample sizes. Real per-workload savings are best measured with the comparison runner on your own prompts.
Routing evaluation
What Atlas measures per call
Before routing to any provider, Atlas has a prior on these signals. As your operations accumulate history, the prior becomes personalized to your traffic.
| Signal |
|---|
| Operation accuracy |
| Latency p50 / p95 |
| Calibrated refusal rate |
| Schema compliance |
Adapter evaluation
What gates a per-operation adapter release
Every adapter goes through the same ship-gate system customers use for dataset promotion. An adapter does not deploy until every gate passes.
| Check |
|---|
| Ship gate pass rate |
| Regression check |
| Latency delta |
| Cost per 1M tokens |
Your improvement curve is visible in the console under each operation. Every eval run is versioned and auditable. The methodology used for any evaluation is the evaluator config you wrote, not something we define for you.
Methodology
What we measure inside the Atlas router
Per-call signals Atlas conditions on when picking the cheapest passing path for an operation. Customer-bound evaluators run on top of these in the reliability loop.
Want the technical detail?
Read the training overview.
The research note on Atlas-1 training covers the base model architecture, pre-training data composition, and the adapter training methodology in technical detail.
Atlas-1 training overview →See it in action
Your eval curve, in the console.
Sign up, tag your first operation, run an evaluator, and see your pass rate over time. No demo data — your actual production traffic.
Get started →