NEWMEN

Benchmarks

Same task. 67% less.

Five representative public benchmarks: summarization, classification, structured extraction, code review, RAG-grounded QA. Atlas-1 Standard tier vs each task’s frontier baseline. Methodology, harness, and raw outputs are published — clone the repo and reproduce them yourself.

Per-task results

Cost-savings, not leaderboard scores.

Each row pits the frontier baseline against Atlas-1 on a representative public dataset. The evaluator decides whether quality held; the cost column shows what you'd pay for the same workload.

Total savings

$26.75

71.7% on 3,580 prompts

Baseline spend

$37.31

Frontier baselines, direct

Atlas spend

$10.56

avg eval 0.756 vs baseline 0.786 (-3.8%)

TaskBaselineAtlas-1 routed toCostEval scoreSavings

Summarization · XSum

200-article subset of the XSum corpus. Score is ROUGE-1 F1 against gold summaries.

openai/gpt-5.5meta-llama/llama-3.1-70b-instructq8 · Hyperbolic$11.42$1.360.4370.40588%

Classification · Banking77

Full Banking77 test set — 3 080 customer-service intents across 77 classes. Score is accuracy@1.

anthropic/claude-sonnet-4.6qwen/qwen-2.5-72b-instructq4 · DeepInfra$12.38$3.940.9020.88668%

Structured extraction · Invoices

50 invoice fixtures, json-schema validator measures pass rate at exact field match.

openai/gpt-5.5deepseek/deepseek-chat-v3q8 · Together$1.95$0.210.9800.94089%

Code review · PR comments

100 redacted GitHub PRs with expected reviewer comments. LLM-judge score against gold rubric.

anthropic/claude-opus-4.7qwen/qwen-2.5-coder-32b-instructq8 · Hyperbolic$5.14$4.100.8530.82120%

RAG-grounded QA · Natural Questions

150 NQ open-domain questions with retrieved-passage context. EM + F1 token overlap.

openai/gpt-5.5meta-llama/llama-3.1-70b-instructq8 · Hyperbolic$6.42$0.950.7560.72885%

Methodology

Each row is pnpm benchmarks:run --task <slug>. The harness runs temperature=0 against the public dataset, scores with the listed evaluator, and emits a JSONL of inputs/outputs/scores. Atlas-1 is run with tier: standard and no eval-history priming — these are first-call results.

Reproduce it

Clone newmen/benchmarks and run pnpm benchmarks:run --all with your own Newmen API key. Numbers should match within token-count noise. Drift is a bug; please file an issue and we’ll re-snap the table.

Run on the live Newmen API; snapshot refreshed weekly. Per-call cost variances vary with prompt length and time of day — these are aggregates over the listed sample sizes. Real per-workload savings are best measured with the comparison runner on your own prompts.

Routing evaluation

What Atlas measures per call

Before routing to any provider, Atlas has a prior on these signals. As your operations accumulate history, the prior becomes personalized to your traffic.

Signal
Operation accuracy
Latency p50 / p95
Calibrated refusal rate
Schema compliance

Adapter evaluation

What gates a per-operation adapter release

Every adapter goes through the same ship-gate system customers use for dataset promotion. An adapter does not deploy until every gate passes.

Check
Ship gate pass rate
Regression check
Latency delta
Cost per 1M tokens

Your improvement curve is visible in the console under each operation. Every eval run is versioned and auditable. The methodology used for any evaluation is the evaluator config you wrote, not something we define for you.

Methodology

What we measure inside the Atlas router

Per-call signals Atlas conditions on when picking the cheapest passing path for an operation. Customer-bound evaluators run on top of these in the reliability loop.

Want the technical detail?

Read the training overview.

The research note on Atlas-1 training covers the base model architecture, pre-training data composition, and the adapter training methodology in technical detail.

Atlas-1 training overview →

See it in action

Your eval curve, in the console.

Sign up, tag your first operation, run an evaluator, and see your pass rate over time. No demo data — your actual production traffic.

Get started →