NEWMEN

Blog

Why we built Newmen

2026-05-12 · 3 min read

A founding note on model drift, the annotation gap, and the three things we decided you should never have to do yourself.

In 2024 and 2025 we worked across enough enterprise AI deployments to see the same failure mode at every one. It went like this. A vendor wins a bake-off on a curated evaluation set. The pilot ships to a contained slice of production. Three weeks in, the AI team starts seeing customer-facing artifacts that did not show up in eval — hallucinated invoice numbers, summaries that bury the lead, polite refusals where a clean answer was available. Engineers triage, write a notebook, propose a fix. Now the question becomes: how do we get this correction into the next model version?

The answer was always the same. Export. Relabel. Re-upload. Wait three weeks for the next training cycle. Discover during validation that the curated set you trained on did not reflect last month's traffic shape. Repeat.

The drift was real. The providers were not seeing it because the providers were not running the systems. The customers were seeing it but could not act on it. The gap between pilot and production was made entirely of stale signal.

The thing nobody was building

There were good model providers. There were good observability tools. There were good annotation platforms. What nobody was building was a platform where the production call itself was the labelable surface — where an engineer could see a problematic completion, correct it in place, and have that correction shape the next version without ever exporting a file. And crucially: where this workflow was model-agnostic. You should not have to re-instrument your stack every time you switch providers.

We started Newmen because that gap was the most expensive one we kept watching. Customers were paying for state-of-the-art capability and then paying again for the friction of keeping it accurate on their own workload.

What we believe

A few things, written down so we can hold ourselves to them.

Model choice is solved. Reliability isn't. The major providers all ship capable models. Our axis is different: how does the model perform on your operations, measured continuously, with the next release gated on those measurements? That is the deliverable. We are not competing with the labs. We are building the layer that makes any of them reliable for a specific customer.

Engineers — not annotators — are the right labelers. The people closest to the workload have the most context. Newmen is built around the assumption that an AI platform engineer reviewing yesterday's production calls is the highest-quality signal source in the system. Annotators have their place; they are not the place.

Training is too important to be self-serve. We thought hard about this. We could have built a self-serve training button. We chose not to. Per-tenant training shapes a model that will then talk to your customers. A human in the loop on every retrain is a feature, not a bug. The loop is fast — request to first eval is days, not weeks — but it is not push-a-button.

Transparency is non-negotiable. Every call recorded. Every dataset versioned. Every evaluator with its rubric on file. Every provider markup shown clearly. If we will not explain a decision, we will not ship it. This applies to model behavior and to product behavior equally.

The shape of the team

We are small and we plan to stay small for a while. Six people built the v1 you are looking at today. The roster is on /about if you are curious. We are hiring a Founding Research Engineer, an ML Platform Engineer, and an Enterprise Solutions Engineer. Each role is described at /careers.

The bet behind Newmen is that the next decade of enterprise AI is won on reliability infrastructure, not capability ceilings. We will be wrong about plenty of things. We do not think we will be wrong about that.

What you no longer have to do

The platform was designed around three eliminations.

Prompt engineering your way to reliability. Prompt engineering is a local maximum. It optimizes for the eval set you have, not the traffic you will see. It re-breaks with every model update. The correct intervention is not a better prompt — it is a routing layer with a training signal attached.

Running training infrastructure. The engineering cost of fine-tuning a model on your own data is high enough that most teams decide not to do it at all. We handle the infrastructure. You tag corrections in the console. The loop runs.

Operating inference at scale. One API key. We handle the model serving, scaling, and cost optimization. You get the API surface. The operational overhead is ours.

If this resonates, talk to us. We mean that.