Research

Atlas-1 training overview

2026-05-15 · 4 min read

A high-level summary of the base model, adapter training procedure, and evaluation pipeline behind Atlas-1.

This note summarizes the training procedure behind Atlas-1 at a level of detail intended for technical evaluators and customers. Atlas-1 has two training surfaces: the base model that backs the routing layer, and the per-operation LoRA adapters that train continuously on customer golden datasets. This document covers both. We deliberately omit specific weight initializations, learning-rate schedules, and proprietary data mixtures. Customers with strategic agreements receive a more detailed methodology document under NDA.

Base model architecture

The Atlas-1 base model is a dense decoder used as the foundation for per-operation adapter training. The architecture is intentionally conservative: standard pre-normalization, rotary position embeddings, grouped-query attention, and SwiGLU activations in the feed-forward layers. We chose a dense decoder for two reasons.

First, dense models are dramatically easier to serve under latency SLOs. Our p50 time-to-first-token target is 210 ms, our p95 is 640 ms, and meeting both was substantially simpler with a single forward path. Second, per-operation adapter training is the central product feature, and dense models are far more amenable to small-data LoRA fine-tuning than mixture-of-experts architectures whose specialization is already opinionated.

The context window is 1,000,000 tokens. We trained with extended position interpolation on a curated long-document corpus during the late pre-training phase, then refined long-context behavior during supervised fine-tuning on multi-hop retrieval and long-document synthesis tasks.

Per-operation adapter training

When a customer's operation accumulates enough golden signal — corrected outputs that have passed evaluators and been promoted to a dataset — Atlas trains a LoRA adapter specific to that operation. The adapter is substantially smaller than the base model and targets only the tasks that operation requires. This is the mechanism behind the accuracy improvement curve: as corrections accumulate, the adapter narrows to your specific task, and the general model's failure modes on your traffic become less relevant.

Adapters are versioned and gated by the same ship-gate evaluators the customer uses for dataset promotion. An adapter does not deploy until it clears the customer's minimum pass rate on their golden dataset. If it fails, the routing layer continues using the base model (or the previously promoted adapter) for that operation.

We do not train shared base models on customer-submitted data. This is a hard line. Per-operation adapter training is performed only on datasets explicitly marked as golden by the customer.

Pre-training data

Pre-training data for the base model was curated in three passes. The first was a broad web crawl filtered for language identification, deduplication, and low-quality boilerplate. The second pass scored documents on a learned quality classifier trained against editorial labels collected internally. The third pass applied per-domain weighting tuned against a held-out perplexity suite representative of customer workloads.

Approximate composition of the pre-training corpus, by token weight:

Code (twenty percent), heavily weighted toward Python, TypeScript, Go, and Rust
Technical writing — documentation, RFCs, papers, internal-style prose — twenty percent
Reference and reasoning material — math, science, structured tasks — fifteen percent
General web text — twenty percent
Books and long-form non-fiction — ten percent
Conversation and dialogue data — ten percent
A held-out reserve of synthetic data targeting under-represented operations — five percent

Supervised fine-tuning

The post-training pipeline runs three stages. Supervised fine-tuning on a curated set of high-quality completions across structured generation, function-calling, retrieval-grounded answering, and tone-controlled writing. Direct preference optimization against a preference dataset assembled from internal annotators and a small bootstrap pool of early customer feedback. Finally, a calibration pass on refusal behavior — we explicitly trained Atlas to prefer an honest "I cannot answer this from the provided context" over a confident hallucination.

The calibration pass is the most distinctive piece. We treat refusal as a first-class output. Production evaluators routinely check that Atlas refuses cleanly on under-specified prompts. Our internal target is that ninety-five percent of intentionally under-specified prompts return a calibrated refusal rather than a fabricated answer.

Evaluation throughout training

We ran three classes of evaluation during training. Standard benchmarks for sanity (MMLU-Pro, HumanEval+, MBPP-Pro, MATH-500). Internal operation evaluations representative of customer workloads — ticket summarization, invoice extraction, contract clause classification, code refactoring. Reliability evaluations specifically targeting the failure modes we have observed in production — under-specification, schema drift, calibration.

The reliability evaluations are the ones we weighed most heavily. They are described in a separate writeup at /research/reliability-evaluations.

We do not publish headline benchmark scores. The scores exist and they are in the range you would expect from a capable dense decoder. We chose not to lead with them because they tell you less about Atlas than your per-operation eval curve does.

What is next

Atlas-2 base model work begins this quarter. We expect a modest architectural change combined with substantially more training compute on a refined data mixture. The biggest delta will be on the adapter training side, where we plan to fold customer golden datasets back into the base post-training pipeline as a recurring signal source — while continuing to run per-operation adapters in parallel and downstream of the shared base.

Customer-facing rollout will follow our standard pattern: opt-in evaluation, comparative results published per-operation in the console, and migration only when a customer's evaluators clear the new version under its ship gates.

This is a high-level note. We are happy to talk in more depth with strategic customers under NDA. Reach us at research@newmen.ai.

← All research