NEWMEN

Blog

Prompt engineering is a local maximum

2026-04-28 · 4 min read

Why iterating on prompts solves the wrong problem, and what the production-safe alternative looks like.

Prompt engineering works. That is worth saying clearly before arguing its limits. If you need a model to produce structured output in a specific format, or to follow a particular reasoning style, or to stay within a defined persona, a well-crafted system prompt is the right tool. It is fast to write, easy to test, and does not require any infrastructure. For a scoped task with a stable input distribution, it solves the problem.

The question is what happens when the input distribution is not stable, or when the task is not scoped.

The local maximum problem

When you prompt-engineer an operation, you are optimizing for the examples in front of you. Usually those are the examples you collected for an eval set — the 50 or 200 or 1000 inputs that represent the task well enough to test against. You iterate on the prompt until it handles those inputs at an acceptable rate. You ship it.

Production is not your eval set. It is the full distribution of inputs your users will actually send — including the long tail you did not think to include, the edge cases that only appear at volume, the adversarial phrasings, the multi-language inputs, the inputs that combine two patterns you tested separately but never together.

A prompt optimized for the eval set is a local maximum. It is the best prompt for the inputs you curated. It may or may not generalize.

This is not a knock on the people who write the prompts. It is a property of the optimization problem. You cannot prompt-engineer your way to coverage you do not have examples for.

The re-breaking problem

The second issue is model updates. Providers ship new model versions regularly. A prompt that was calibrated for claude-3-sonnet may behave differently on claude-3-5-sonnet. Not always worse — sometimes better on average — but the tail behavior shifts. Edge cases that were handled correctly start failing. Edge cases that were failing start handling correctly. The eval set gives you a noisy read on which direction the net effect went; it does not tell you what broke specifically in production.

Every model update requires a prompt re-tuning cycle. The cost is real. Teams that have invested heavily in prompt engineering find themselves running this cycle every few months, and the maintenance burden compounds as the number of operations grows.

A concrete example

Consider a ticket summarization operation. You train it on your 20 most common ticket types: billing issues, password resets, shipping delays, account questions, and so on. The prompt does well on all 20. You ship it.

Three weeks into production, ticket type 21 starts appearing at meaningful volume. It is a new product feature that launched last week — a category that did not exist when you built the eval set. The prompt handles some of these tickets reasonably and fails on others in a pattern that takes two days to notice because the failure rate is below the threshold that triggers an alert.

You debug it. You update the prompt. You re-run evals. You ship a fix. This is fine for ticket type 21. Ticket type 22 arrives six weeks later.

The prompt revision cycle is bounded by how fast you can identify and codify new task categories. Production traffic is not.

The alternative

The intervention that scales differently is not a better prompt. It is a combination of three things.

Routing. Instead of asking one model to handle all ticket types with a single prompt, route by operation type. Ticket summarization is an operation. A structured extraction for billing disputes is a different operation. Each operation routes to the model that performs best on its specific traffic. A routing layer that tracks per-operation accuracy can make this decision automatically as signal accumulates.

Tagging. When a call fails — when the summary misses the lead, when the extraction halves the dollar figure — that failure is a labeled example. You tag the correction on the call record. You do not need to build a separate annotation pipeline to capture it. The failure and the correction live together, queryable as training data.

Training. Once you have a corpus of corrections on a specific operation, you can train a task-specific adapter. That adapter is smaller than the general model, cheaper to serve, and more accurate on your specific task because it was trained on your specific traffic — not on a curated eval set, but on the actual production distribution with corrections attached.

The accuracy improvement curve from this loop is not a prompt revision. It is a training artifact. It generalizes to inputs you have not seen before because the underlying model has been adapted toward your distribution, not toward your eval set.

The conclusion

Prompt engineering solves a specific problem well: getting a model to behave in a particular way on a known set of inputs. It is the right tool for that problem.

It is not the right tool for reliability at scale. Reliability at scale requires observing what actually fails in production, capturing those failures as training signal, and letting the model narrow toward your specific task over time. The prompt is upstream of that loop. It is not a substitute for it.

The transition is not dramatic. You do not throw away the prompt. You add the loop.