NEWMEN

Blog

We cut our inference bill 60% by changing one string

2026-05-26 · 3 min read

Atlas-1 is now Newmen's smart-routing default. Drop in the model name, and Atlas picks the cheapest path that passes your eval gates per call. If quality drops, the call isn't billed.

We just shipped a pivot that’s been a long time coming. The headline:

- model: "chatgpt-5.5",
+ model: "atlas-1",

One string. Same SDK. Same response shape. 30–70% off inference, depending on the workload. And if a call ever scores below the threshold of an evaluator you bound to that operation, it isn’t metered — period.

That last sentence is the contract the cost story rests on. Every other inference broker says some flavour of “trust us, the cheaper variant is fine.” That trust never survives the first production regression. Atlas inverts the contract: we only route cheaper when your evals say it’s safe, and we put our bill on the line if we’re wrong.

How it actually works

When you pass model: "atlas-1", the router looks at three things for that specific call:

  1. The operation you tagged. metadata.operation_id: "summarize_ticket" tells Atlas which workload this call belongs to. If you haven’t bound an evaluator to that operation, Atlas defaults to the conservative path; if you have, the eval-gate history is the signal.
  2. The eval-gate history. Atlas tracks per-operation, per-quantization, per-provider pass rates. If q8 on Hyperbolic has stayed green on summarize_ticket for two weeks, that’s the path. If it slipped last Tuesday, Atlas falls back.
  3. The tier you asked for. realtime is direct-passthrough full-precision (the safe default). standard is “cheapest passing variant.” batch is “largest discount, async-friendly.” Default is standard.

The response carries a delivery block telling you exactly what Atlas picked:

{
  "id": "chatcmpl-…",
  "delivery": {
    "tier": "standard",
    "served_by": "provider",
    "provider": "Hyperbolic",
    "quantization": "q8",
    "upgraded": false
  }
}

No magic. The decision is visible per call.

The worked example

A customer running 30M tokens / month of support-ticket summarisation on chatgpt-5.5 was paying about $210/mo. They flipped one string and added a regex evaluator with min_score: 0.95 to the summarize_ticket operation. After two weeks of green eval gates on Standard tier, Atlas routed 80% of calls to meta-llama/llama-3.1-70b-instruct Q8 on Hyperbolic and 20% to ChatGPT-5.5 on calls the loop said still needed it.

New monthly cost: ~$84. That’s a 60% reduction. Calls scoring below the regex threshold weren’t metered. Their bill matches the contract.

What we shipped

  • tier: "realtime" | "standard" | "batch" on every chat completion.
  • tier_strict: true if you want the router to fail rather than silently upgrade.
  • forbid_atlas_network: true if your compliance posture forbids consumer-GPU partners on this call.
  • A delivery block on every response.
  • Eval-gated quality refund (calls below your bound evaluator’s min_score aren’t metered).
  • A new public benchmarks page with reproducible per-task cost-savings on five datasets (XSum, Banking77, JSON invoice extraction, code review, NQ-RAG). Methodology and harness open-source so you can clone, run, and verify.
  • An in-product comparison runner at /console/compare — paste 20 prompts, see what your bill becomes.

Atlas Network — open beta applications

The longer-range cost story is Atlas Network: a pool of consumer GPUs run by approved partners, hosting open-weight models with eval-verified quality. We’re currently in invite-only beta and accepting applications at /atlas-network. When the desktop app ships, the same atlas-1 calls will transparently route to partners for open-weight workloads where customer policy allows. Closed-weight models (OpenAI, Anthropic, Gemini, Grok) never leave managed providers regardless.

What this is not

It’s not 10× faster. We had that as a third claim in early drafts and cut it. Atlas’s avoidance of saturated providers does drop p95 latency 10–30% at peak hours, but it’s a side effect, not a headline. Speed is real on some workloads and not others; we’ll talk about it where it’s defensible.

It’s also not a magic compression scheme. We don’t fine-tune away your distribution shift; we just don’t bill you when the quantized variant gets it wrong. The mechanism is honest, the receipts are public, and the eval loop you already have is the only thing standing between cheaper and worse.

Try it on your prompts

The comparison runner at /console/compare takes 60 seconds: paste a sample of your existing prompts, pick your baseline model, run, and see the cost / latency / output deltas side by side. The result page is shareable — forward it to your team to sign off on the switch.

If you’d like a hand setting it up for your top-volume operation, drop me a line at matthew@newmen.ai. The first ten teams I talk to this week get a 30-minute walkthrough.