
Nous Research, a San Francisco-based artificial intelligence startup, on Tuesday released an open-source mathematical reasoning system called Nomos 1, which achieved near-outstanding human performance in this year’s William Lowell Putnam Mathematical Competition, one of the most prestigious and notoriously difficult undergraduate mathematics competitions in the world.
Putnam is known for its difficulty: while a perfect score is 120, this year’s top score was 90, and the average was just 2. In contrast, Nomos 1 scored 87 points – resulting in second place out of 3,988 participants in the 2024 competition, according to the company.
This release marks a turning point in the rapidly growing race to create AI systems capable of sophisticated mathematical reasoning. Unlike the huge, compute-intensive models deployed by major technology companies, Nomos 1 achieves its results with a relatively compact architecture: 30 billion parameters with about 3 billion active at any time, using a mix-expert design based on Alibaba’s Quen3 model.
"This score will rank #2/3988 in 2024 and is our first step with Hillclimb AI towards building SOTA AI mathematicians." Nous Research announced this on social media on Tuesday.
The same base model without special training from Nous Research scored 24 points
Perhaps most striking is the difference between the Nomos 1 and its base model. When Nous Research ran the same Qwen3-30B-A3B-Thinking-2507 model through the same testing harness, it got only 24 points out of 120 – a result that underlines the critical importance of post-training optimization and specialized reasoning techniques at raw model scale.
"Nomos 1 achieved 87/120 with 8 perfect scores," Noting the difference in performance, the company said "The main reason for this is post-training and data quality rather than the harness."
The results were verified through blind grading by a human expert who had previously finished in the top 200 on Putnam. Nous Research provided anonymized submissions to Grader, then published the complete set of de-anonymized files and the runbooks used to generate them on GitHub.
Why is the Putnam Competition considered the ultimate test of mathematical reasoning?
The William Lowell Putnam Mathematical Competition is an annual mathematics competition for undergraduate college students enrolled in institutions of higher education in the United States and Canada. It is considered the most prestigious university-level mathematics competition in the world.
The notoriously brutal William Lowell Putnam Mathematical Contest is more of a mathematical sports competition than an academic test. The exam consists of two sessions of 3 hours each, with a 2-hour break in between. There are a total of 12 questions to solve, 6 for each session. Each question is of 10 marks, i.e. total 120 marks.
Putnam questions are not the type that appear on regular exams or textbooks. They are more like puzzles than calculations, often requiring students to find different ways of presenting things before coming up with a solution.
Last year, nearly 4,000 students across the continent wrote Putnam. Sixty-one percent scored three points or less, according to the Mathematical Association of America, which organizes the competition. The highest score was 90 out of 120.
Many Putnam Fellows have become distinguished researchers in mathematics and other fields, including three Fields Medalists – John Milnor, David Mumford and Daniel Quillen – and two Nobel Prize winners in physics – Richard Feynman and Kenneth Wilson.
Inside the two-step logic system that powers Nomos 1’s mathematical breakthroughs
Nomos 1 is a specialization of Quen’s Quen3-30b-A3b-thinking model, optimized for mathematical problem-solving and proof-writing in natural language. The system was developed in collaboration with Hillclimb AI.
What differentiates Nomos 1 from simple model inference is its sophisticated logic harness – an open-source framework that explains how the model approaches and solves problems. The harness operated in two separate phases within a three-hour time limit, mirroring the actual Putnam competition structure.
In the solution phase, parallel workers tackle problems simultaneously using a priority-based system. Each worker chooses a problem, prepares a submission, then scores their work on a scale of 1 to 7. Problems with the lowest correct scores receive priority, ensuring that the system focuses its calculations on the most difficult challenges. This process continues until either all problems have achieved the target number of self-reviewed full points or time has expired.
The finalization phase begins 15 minutes before the deadline (or 50% for shorter runs) and follows a two-step selection process. First, an aggregation step groups the presentations based on the findings and attempts to identify the correct group – importantly, not necessarily the majority group. Then, a pairwise tournament using single elimination determines the final submission for each problem.
"Our open source reasoning system consists of a solution phase, where workers attempt a minimally solved problem and self-evaluate, followed by a final phase, which consolidates the submissions to select a final submission for each problem." Nous Research explained.
How Nomos 1 compares to mathematical AI systems from DeepSeek, Google, and OpenAI
The results of Nomos 1 come amid a flurry of advances in mathematical reasoning and AI. DeepSeek’s model, DeepSeekMath-V2, scored 118 out of 120 on questions from the 2024 William Lowell Putnam Mathematical Competition, beating the top human score of 90. The model also performed at the level of gold medal winners in the International Mathematical Olympiad.
This year, Google’s advanced Gemini model worked end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem statement – all within the 4.5-hour competition time limit. They achieved this year’s results using an advanced version of Gemini Deep Think.
What makes Nomos 1’s achievement remarkable is not the raw performance – it lags behind DeepSeek’s 118/120 – but its reach and efficiency. At 30 billion parameters with only 3 billion active, the model can run on consumer-grade hardware, a sharp contrast to the huge computing clusters required for OpenAI and Google’s Frontier models.
Hermes 4.3 arrived just six days ago, trained on a decentralized blockchain network
The announcement of Nomos 1 follows Nous Research’s December 3 release of Hermes 4.3, a general-purpose language model that marked another important milestone for the company.
Hermes 4.3, based on ByteDance’s SEED-OSS-36B-base model, is the first production model that Nous Research has trained entirely on its Psyche network – a distributed training infrastructure that uses a novel optimizer called DisTrO to coordinate training across nodes spread across data centers over the open Internet, secured by consensus on the Solana blockchain.
The company trained Hermes 4.3 through both traditional centralized methods and the Psyche network, specifically to verify that distributed training could match or exceed centralized performance for production workloads. The company reported that the Psyche-trained version outperformed the centralized version across a set of downstream tasks.
"The training program proved stable throughout, averaging 144k tokens/sec spread across 24 Psi nodes," Nous Research said. "Using DisTrO’s overlapped collective strategy, the entirety of the P2P communication is hidden until the time of training, effectively achieving throughput equivalent to traditional, centralized training."
Hermes 4.3 also achieved state-of-the-art results on RefusalBench, a new benchmark that measures a model’s willingness to be helpful in various scenarios typically prohibited by other models. The model answered 74.60% of RefusalBench queries in non-rational mode, surpassing its predecessor Hermes 4 70B (59.50%) and outperforming discontinued models including the Grok 4 (51.30%) and Gemini 2.5 Pro (24.23%).
Small models with smart training are closing the gap with trillion-parameter giants
Together, the two releases in the same week indicate Nuss Research’s strategic bet: Smaller, more efficient models with sophisticated training techniques and reasoning harnesses can compete with — and in some cases outperform — larger models developed by better-funded competitors.
For enterprise decision makers, the implications are significant. The application of mathematical reasoning abilities goes far beyond academic competitions: they are essential for formal verification, theorem proving, scientific modeling, cryptographic analysis, and any domain requiring rigorous logical deduction.
The open-source nature of both releases – Nomos 1 is available under the Apache 2.0 license on Hugging Face, with the full logic harness on GitHub – means organizations can deploy these capabilities on their own infrastructure without relying on API calls to major cloud providers.
"For the first time, anyone can run or access cutting-edge AI Mathematician," one observer noted on social media. "It reduces the barrier to serious mathematics research, proof verification, modeling of complex systems, advanced logic work."
Key contributors to Nomos 1 include Roger Zinn, who led the training; Jeffrey Quesnel and Dakota Mahan, who built the infrastructure; Chen Guang, who provided advice; and Ryan Technium and Jeffrey Quesnel, who provided leadership. The model was developed with contributions from Hillclimb AI and a team of mathematics experts including Samuel Kim, Miron Yurkevich and others.
The race to create AI mathematicians is happening faster than anyone anticipated
The 86th Putnam Competition took place on Saturday, December 6, 2025 – just three days before Nomos 1 was released by Nous Research. The timing underscores how rapidly the field is advancing: Companies are now releasing mathematical AI systems that are capable of achieving nearly elite human performance within days of the competitions they are designed to solve.
The competition in mathematical AI has intensified dramatically in recent months. In July, an improved version of Google DeepMind’s Gemini model and OpenAI’s experimental reasoning model both achieved gold status at IMO 2025. DeepSeek’s new model matched their performance, solving 5 out of 6 problems.
But the resource requirements for those frontier systems remain prohibitive for most organizations. OpenAI’s o1-pro is estimated on more than 1.8 trillion parameters; Google’s Gemini 2.5 Pro is likely to exceed $400 billion. In contrast, the Nomos 1 achieves competitive results with a fraction of that footprint.
The gap between the massive frontier model and efficient open-source alternatives is narrowing. And for organizations that need mathematical reasoning capabilities without the budget for hyperscale compute, that gap may now have closed substantially.
As one observer put it on social media: "This is a significant leap forward for AI math models that are small enough to run on your laptop."
A laptop that can now outperform nearly 4,000 of the continent’s best undergraduate mathematicians.
<a href