Researchers Say They Trained A Foundation Model From Scratch For About $1,500

Training a foundation LLM from scratch costs millions and requires internet-scale data – which is why most enterprises don’t bother. Sapient thinks its way out is cheaper.

To overcome this brute-force scaling dogma, Sapient researchers developed HRM-Text, which replaces the standard Transformer with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year.

HRM divides the computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text is trained exclusively on instruction-response pairs. This is closer to real-world enterprise settings, where users typically expect a targeted response to a specific task.

The researchers were able to train a 1b-parameter HRM-text from scratch at a fraction of the cost and tokens of a typical LLM. Their model achieved competitive performance with much larger open models on major industry benchmarks.

For real-world AI applications, this means that basic pre-training is no longer limited to high-resource institutions. With HRM-Text, organizations can cost-effectively train their own highly capable reasoning models from the start and connect them with external knowledge repositories.

training obstacle

When we train an LLM, we don’t really care whether it remembered the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, implicit understanding of human language, logic, facts and logic.

The current approach is brute force: scour the Internet, run the next-token prediction trillions of times, and assume the model has developed a working internal model of the world.

Basically, this means that we waste millions of dollars of computing power forcing models to remember everything they collect from the Internet, so that they can learn to think indirectly. For example, standard decoder-only models spend valuable computation specifying the loss for reconstructing the prompt, even if the user’s prompt is already known and provided at the time of inference.

Rather than viewing this simply as a computational bottleneck, the industry should recognize it as a serious business limitation. In comments provided to VentureBeat, Sapient Intelligence CEO Guan Wang framed this as an issue "The economics of repetition."

"Enterprises today face three complex problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry’s scaling addiction says: ‘When the model fails, make it bigger.’ Add more data. Add more GPUs.’ It’s worked, but it’s reaching the point of diminishing returns. More scale often means more memory, more latency, more infrastructure, and more vendor dependency. This does not necessarily give an enterprise a better logic engine."

It is this architectural and computational inefficiency that makes retrofitting existing dense transformers not always a silver bullet for enterprises. Fine-tuning a model to preserve its general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control.

"Imagine a hedge fund, insurer or bank that has immense proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints," Wang said. "They may not want to send that data to an external frontier model, and they may not need a huge general-purpose model that remembers the Internet. They need a compact reasoning core that can learn their task structure, logic between rules and numbers and run in a controlled environment."

Because HRM-Text focuses its calculations strictly on task completion and latent logic, it allows enterprises to start with a small, smart model and adapt it to a proprietary domain with very little infrastructure.

Rethinking Architecture with HRM-Text

HRM, which was introduced in 2025, represents a fundamental departure from the traditional transformer model. To create a more sample-efficient engine, HRM divides the computation into slow-evolving strategic and fast-evolving execution layers. The faster L-module performs local iterative refinement, while the slower H-module maintains stable semantic context throughout the cycle. The processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by one slow H-module update.

Standard parameter-shared recurrent architectures (such as Samsung’s TRM) can sometimes handle small logic puzzles, but Sapient researchers found that they become highly unstable when scaled up to 1-billion parameters for language tasks. The separation between slow H-modules and fast L-modules of HRM is mathematically necessary, not just an aesthetic choice. As Wang said: "For logic grids, you can sometimes get away with a small recursive mechanism because the world is clean and finite. Language is not like that. Language requires both fast local refinement and slow semantic stability."

While the original HRM proved highly effective for controlled, symbolic logic problems, researchers hit a hurdle when applying it to the vast, open-ended complexities of generalized language modeling. While HRM’s loops make it an incredibly efficient thinker, those same loops make it mathematically unstable to train on the diverse chaos of human language. Running recurrent loops on the language produces large-scale mathematical instabilities, in particular, exploding or vanishing gradients.

To prevent this feedback loop in neural networks, researchers introduced two major architectural innovations in HRM-Text. First, they developed MagicNorm, a special normalization technique specifically designed to keep internal signals stable no matter how many times the model loops its thought process.

Second, he devised a warm-up method to stabilize the training. During initial training, the model is evaluated only on small, shallow logic cycles. As training progresses, the system warms up, gradually feeding the model deeper and longer logic sequences.

They also switched the training objective from next-token prediction to task completion, where the model is only rewarded on the completed response, not the individual tokens it generates. To achieve this goal, they transformed the training data of HRM-Text from raw text to simply instruction-response pairs.

HRM-lessons in action

The researchers created a highly compact 1-billion-parameter HRM-text model. Instead of using a standard multi-stage pipeline, which requires churning through trillions of words of raw Internet text, they trained it from scratch on a tightly curated dataset of just 40 billion tokens. The training data consisted entirely of instruction-response pairs in general instruction, mathematics, symbolic reasoning, textbook exercises, and restated knowledge.

They trained the model using the task-completion objective. To force the model to rely on its internal hierarchical architecture rather than copying the logic step-by-step, he explicitly removed "Thinking" Tokens from training data.

The model was evaluated across a diverse suite of standard fundamental AI benchmarks, indexed heavily on knowledge, logic, reasoning, mathematics, and comprehension. The researchers tested HRM-Text against both small models and high-resource open-weighted and fully open models.

The results show a significant shift in the computation-to-performance threshold. The 1b-parameter HRM-Text achieved 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH. This performance is highly competitive with (and in many cases even surpasses) the 2B to 7B parameter foundation models.

The most important measures for the enterprise audience are the efficiency statistics and practical implications. Pre-training a foundation model from scratch is typically a multi-million dollar endeavor for tech giants. HRM-Text was trained in just 1.9 days on a cluster of 16 GPUs. The total estimated computation cost was approximately $1,500. It achieved its competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times fewer inference calculations than models such as Quen, Gemma, and Llama.

Another important point is to separate reasoning from knowledge recall. From a practical perspective, HRM-Text’s success on logic-heavy tasks despite its small 40B-token training regimen proves that a model does not need to memorize the entire Internet to become a smart logic engine.

For enterprise applications, this behavior is a feature, not a bug. Researchers suggest a future where businesses deploy highly compact, incredibly cheap recurring models that work "logic root" Specialized for business logic. Instead of forcing the model to memorize the company database during pretraining, the model acts as a reasoning engine, relying on external retrieval systems to retrieve factual knowledge.

Critics have pointed out that training on instruction-response pairs pales in comparison to models trained on raw text. "apples-to-oranges" landscape. Wang emphasizes this framing, explaining that every serious modern LLM looks at instruction-response data during training or alignment. "So the comparison is not apples to oranges. This apple is close to the core and apple. We started directly with the main task format because that’s how people actually use models: they give an instruction and expect a useful response," He said.

The researchers also ran rigorous contamination tests to ensure that the model was not simply missing benchmark answers. On DROP, a benchmark that shows marginal contamination signals under a specific setting, HRM-Text still scored an impressive 81.1% on the strictly clean, 0% contamination subgroup.

Ultimately, Wang argues that for enterprises, "True valuation is not common sense. This is a workflow assessment… Give HRM-Text a task such as: multi-step financial logic, compliance logic, scientific workflow automation, structured extraction after logic."

Practical Implementation and the Future of Enterprise AI

While the benchmark scores and cost efficiency are striking, Sapient is clear about the current limitations of the model. The initial release is best viewed as a proof of concept, similar to the initial GPT release, designed to demonstrate the unique benefits of the architecture.

"To be honest, HRM-Text is not a plug-and-play ChatGPT replacement yet," Wang said. "It is a compact foundation language logic model. For an enterprise engineering team, operational work primarily revolves around templates, mode selection, attention masking, and alignment."

For AI engineering teams looking to experiment, getting started requires some specific, but standard, text-generation discipline. The model lists native support in the Transformer library (requires Transformer >= 5.9.0), and usage paths for VLLM and SGLang are being actively developed. The primary engineering work involves managing the PrefixLM design: production multi-turn chat applications will require careful KV-cache logic to ensure that user signals receive full bidirectional attention, while the assistant’s outputs remain causal.

"When the cost of training a capable reasoning model drops to about $1,500, AI is no longer just an infrastructure question and becomes a strategy question," Wang said. "A Fortune 500 company no longer needs to ask, ‘Can we buy the foundation model?’ It will ask, ‘What should our model know about our business, and what type of logic should it be optimized for?’"

<a href

Researchers say they trained a foundation model from scratch for about $1,500

training obstacle

Rethinking Architecture with HRM-Text

HRM-lessons in action

Practical Implementation and the Future of Enterprise AI

Like this:

Related

Leave a Comment Cancel reply

training obstacle

Rethinking Architecture with HRM-Text

HRM-lessons in action

Practical Implementation and the Future of Enterprise AI

Share this:

Like this:

Related

Leave a Comment Cancel reply