Nvidia debuts Nemotron 3 with hybrid MoE and Mamba-Transformer to drive efficient agentic AI

crimedy7 illustration of the nvidia colors in a digital archi 31fa1654 c274 4f35 9673 8879080998cf 1
Nvidia launched a new version of its Frontier model, Nemotron 3, relying on the model architecture, which the world’s most valuable company said provides greater accuracy and reliability for agents.

Nemotron 3 will be available in three sizes: Nemotron 3 Nano with 30B parameters, primarily for targeted, highly efficient tasks; the Nemotron 3 Super, which is a 100B parameter model for multi-agent applications and with higher-accuracy logic, and the Nemotron 3 Ultra, with its larger logic engine and approximately 500B parameters for more complex applications.

To create the Nemotron 3 model, Nvidia said it turned to a hybrid mix-of-experts (MOE) architecture to improve scalability and efficiency. By using this architecture, Nvidia said in a press release that its new models provide enterprises with greater openness and performance when building multi-agent autonomous systems.

Kari Brisky, Nvidia’s vice president for generative AI software, told reporters at a briefing that the company wanted to demonstrate its commitment to learning and improving from previous iterations of its models.

“We believe we are uniquely positioned to serve a wide range of developers who want full flexibility to customize models to build specialized AI by combining the new hybrid of our expert architecture with a 1 million token context length,” Brisky said.

Nvidia said early adopters of the Nemotron 3 model include Accenture, CrowdStrike, Cursor, Deloitte, EY, Oracle Cloud Infrastructure, Palantir, Perplexity, ServiceNow, Siemens and Zoom.

Breakthrough Architecture

Nvidia is using a hybrid Mamba-Transformer mix-of-expert architecture for many of its models, Nemotron-Nano-9B-V2 incl.,

The architecture is based on research from Carnegie Mellon University and Princeton, weaving in selective state-space models to handle long pieces of information while maintaining state. This can reduce computation cost even through long references.

Nvidia said its design “achieves 4x higher token throughput” than the Nemotron 2 Nano and can significantly reduce inference costs by reducing logic token generation by up to 60%.

“We really need to be able to increase that efficiency and reduce the cost per token. And you can do that in a number of ways, but we’re really doing that through innovations to that model architecture,” Brisky said. “The hybrid Mamba Transformer architecture runs many times faster with less memory, because it avoids these huge attention maps and key value caches for every single token.”

Nvidia also introduced an additional innovation for the Nemotron 3 Super and Ultra models. For these, Brisky said Nvidia has “deployed a breakthrough called latent MOE.”

“You have all these specialists in your model who share a common core and only keep a small portion private. It’s kind of like chefs who share a big kitchen, but they need to get their own spice rack,” Brisky said.

Nvidia is not the only company that uses this type of architecture to create models. Recently AI21 Labs uses it for their Jamba model In its Jamba Reasoning 3B model,

The Nemotron 3 model benefited from extended reinforcement learning. The larger models, Super and Ultra, used the company’s 4-bit NVFP4 training format, which allows them to be trained on existing infrastructure without compromising accuracy.

Benchmark testing from synthetic analysis placed the Nemotron model highly among similarly sized models.

New ‘workout’ environment for models

As part of the Nemotron 3 launch, Nvidia will provide users with access to their research by releasing their papers and sample signals, offering open datasets where people can access and view pre-training tokens and post-training samples, and most importantly, a new Nemo Gym where customers can let their models and agents “work out.”

Nemo Gym is a reinforcement learning lab where users can let their models run in a simulated environment to test their post-training performance.

AWS announced a similar tool through its nova forge platformIntended for enterprises that want to test their newly created distilled or small models.

Brisky said the post-training data samples that Nvidia plans to release are “orders of magnitude larger than any available post-training data set and are also very permissive and open.”

Nvidia pointed to developers looking for highly intelligent and performant open models so they can better understand how to guide them when needed as the basis for releasing more information about how it trains its models.

“Model developers today are struggling with this difficult trifecta. They need to find models that are highly open, that are highly intelligent, and that are highly efficient,” he said. “Most open models force developers to make painful trade-offs between efficiencies like token cost, latency, and throughput.”

He said developers want to know how a model was trained, where the training data came from, and how they can evaluate it.



<a href

Leave a Comment