Blog

Why we didn't build a separate trace pipeline

2026-03-22 · 4 min read

The observability design decision that eliminated the export-label-upload cycle — and why it means corrections feed the optimization loop without a pipeline.

The default way to add observability to an LLM API is to run a trace pipeline alongside it. You instrument your application, emit spans to a collector, and ship them to a store. The trace is a copy of the call, structured differently, living somewhere else. We looked at this design and decided it was wrong for what we were trying to do.

The problem with copies

When the trace is separate from the call, you have two records of the same event. They drift. Not immediately, not dramatically, but over time in ways that are annoying to debug. More importantly, any correction you make to the trace — any annotation, any rating, any improved expected output — lives in trace storage, not next to the call. Getting that signal back into your evaluators and the optimization loop requires another pipeline.

Every team we talked to during design partner interviews had some version of this problem. A spreadsheet of annotated examples. A data pipeline that exported from the trace store, transformed, and shipped somewhere else to act on. Someone on the team who maintained that pipeline and knew its quirks. The correction signal existed; it was just several steps removed from being usable.

This problem shows up regardless of which AI provider you use. The gap is not in the model. It is in the instrumentation layer around it.

Calls are the trace

We built it the other way. When you send a request through Newmen — to Atlas, to GPT-4o, to Gemini, to any supported model — the request payload and the response are recorded. The record is the call. There is no separate trace. Feedback you leave on a call — a rating, a correction, a set of tags — is stored on the call record, not in a side system. When you build a dataset, you query calls by operation and filter by feedback. The dataset items point back to the original call records. Everything is in one place.

This sounds simple and it is, but the design decision has downstream consequences that took a while to work out.

What we had to give up

The main thing we gave up is the rich span structure that a proper tracing system gives you. If you have a multi-step chain — retrieve, rerank, generate — Newmen only observes the generate step. The retrieval and reranking are opaque. For teams that want to attribute failures to specific chain steps, that is a real limitation. We are thinking about it.

The second thing is that our call records are structurally similar to provider call logs, not to Datadog traces or Jaeger spans. Teams that have invested in distributed tracing infrastructure cannot just pipe Newmen calls into that system and have it work. metadata.operation_id gives you a logical grouping, but it is not a trace ID in the W3C sense.

For the customers we were designing for — teams where the LLM call is the unit of work, not one step in a larger instrumented pipeline — the tradeoffs landed in the right place.

The tagging interface as a consequence

The decision also shaped the console. Because corrections live on call records, the tagging interface is just a view into call records. You pull up a call, you see the request, you see the response, you write a correction in the same place. There is no "export to label" step. The correction is immediately queryable as a candidate dataset item.

This sounds like a product feature. It is. But it flows directly from the data model decision. When the record and the annotation live together, the annotation is cheap to write and cheap to use.

Importantly: this applies to every model on the platform. A call to openai/chatgpt-5.5 is recorded, correctable, and promotable into a dataset just like a call to atlas. The loop runs on your own traffic, never on a model we trained. The model field is a detail; the correction is the signal.

We have not regretted this design, but I want to be honest that it is a constraint. If your observability requirements go significantly beyond LLM call inspection, Newmen's built-in observability will not be enough on its own.

The bigger consequence, though, is for the export-label-upload cycle. When the correction lives on the call record, there is no pipeline to write to put that correction to work — it immediately becomes signal your evaluators can score and the optimization layer can condition on, and it's there if you ever opt into per-tenant tuning. The correction is already in the right place. This is the design decision behind "no pipeline" in the platform pitch — it is not that we eliminated the concept of labeling, it is that we eliminated the infrastructure gap between a labeled example and a usable dataset item.

← All posts