Blog

Why we built Newmen

2026-05-12 · 4 min read

A founding note on model drift, the annotation gap, and the three things we decided you should never have to do yourself.

In 2024 and 2025 we worked across enough enterprise AI deployments to see the same failure mode at every one. It went like this. A vendor wins a bake-off on a curated evaluation set. The pilot ships to a contained slice of production. Three weeks in, the AI team starts seeing customer-facing artifacts that did not show up in eval — hallucinated invoice numbers, summaries that bury the lead, polite refusals where a clean answer was available. Engineers triage, write a notebook, propose a fix. Now the question becomes: how do we get this correction into the next model version?

The answer was always the same. Export. Relabel. Re-upload. Wait three weeks for the next training cycle. Discover during validation that the curated set you trained on did not reflect last month's traffic shape. Repeat.

The drift was real. The providers were not seeing it because the providers were not running the systems. The customers were seeing it but could not act on it. The gap between pilot and production was made entirely of stale signal.

The thing nobody was building

There were good model providers. There were good observability tools. There were good annotation platforms. What nobody was building was a platform where the production call itself was the labelable surface — where an engineer could see a problematic completion, correct it in place, and have that correction shape the next version without ever exporting a file. And crucially: where this workflow was model-agnostic. You should not have to re-instrument your stack every time you switch providers.

We started Newmen because that gap was the most expensive one we kept watching. Customers were paying for state-of-the-art capability and then paying again for the friction of keeping it accurate on their own workload.

What we believe

A few things, written down so we can hold ourselves to them.

Model choice is solved. The bill and the proof aren't. The major providers all ship capable models. Our axis is different: serve the same workload the cheapest way that still holds quality, and prove it on your own traffic. That is the deliverable. We are not competing with the labs. We are the optimization layer that makes the models you already use cheaper — verifiably.

Engineers — not annotators — are the right labelers. The people closest to the workload have the most context. Newmen is built around the assumption that an AI platform engineer reviewing yesterday's production calls is the highest-quality signal source in the system. Annotators have their place; they are not the place.

We optimize routing, not your model. Atlas is an optimization engine — it learns which path holds quality at the lowest cost on your workload. We never train a language model on your data by default; your prompts and code stay yours. Per-tenant tuning exists only as an explicit opt-in (the Reliability Loop), with a human in the loop on every retrain — a feature, not a bug.

Transparency is non-negotiable. Every call recorded. The served model, provider, and price on every call. Every dataset versioned. Every evaluator with its rubric on file. Your verified savings shown line by line. If we will not explain a decision, we will not ship it.

The shape of the team

We are small and we plan to stay small for a while. Six people built the v1 you are looking at today. The roster is on /about if you are curious. We are hiring a Founding Research Engineer, an ML Platform Engineer, and an Enterprise Solutions Engineer. Each role is described at /careers.

The bet behind Newmen is that the next decade of enterprise AI is won on reliability infrastructure, not capability ceilings. We will be wrong about plenty of things. We do not think we will be wrong about that.

What you no longer have to do

The platform was designed around three eliminations.

Prompt engineering your way to reliability. Prompt engineering is a local maximum. It optimizes for the eval set you have, not the traffic you will see. It re-breaks with every model update. The correct intervention is not a better prompt — it is an optimization layer with verification attached.

Babysitting your inference bill. Picking the cheapest model by hand, re-checking it every release, hoping it held quality. Atlas serves each call the cheapest way that clears your bar and shows you the savings line by line. You point at us with one env var; the optimization is ours.

Operating inference at scale. One env var. You keep the models you use. We handle sourcing each call the cheapest way that holds quality, and you see exactly what you saved. The operational overhead is ours.

If this resonates, talk to us. We mean that.

← All posts