
Reducing the cost of estimation is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four major estimation providers are reporting 4x to 10x reductions in cost per token.
Dramatic cost reductions were achieved by using Nvidia’s Blackwell platform with an open-source model. Production deployment data from Baseten, DeepInfra, Fireworks AI, and Together AI shows significant cost improvements in health care, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users.
The 4x to 10x cost reductions reported by estimated providers require Blackwell hardware to be combined with two other elements: a customized software stack and switching from proprietary to open-source models that now match frontier-level intelligence. According to the analysis, hardware improvements alone led to 2x benefits in some deployments. Reaching larger cost reductions requires adopting lower-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates.
The economics prove the opposite. Reducing inference costs requires investment in high-performance infrastructure because throughput improvements directly translate into lower per-token costs.
"Performance is what minimizes the cost of estimation," Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. "What we’re seeing in the estimate is that the throughput literally translates into a real dollar value and drives down the cost."
Shows 4x to 10x reduction in production deployment costs
Nvidia detailed the four customer deployments in a blog post, showing how the combination of Blackwell infrastructure, optimized software stack, and open-source model delivers cost reductions across a variety of industry workloads. Case studies extend to high-volume applications where estimate economics directly determine commercial feasibility.
According to Nvidia, Sully.ai cut healthcare AI inference costs by 90% (10x reduction) while improving response times by 65% by switching from a proprietary model to an open-source model running on Besten’s Blackwell-powered platform. The company returned more than 30 million minutes to physicians by automating medical coding and note taking tasks that previously required manual data entry.
Nvidia also reported that Latitude reduced gaming inference costs for its AI Dungeon platform by 4x by running large mixture-of-experts (MOE) models on DeepInfra’s Blackwell deployment. The cost per million tokens dropped from 20 cents on Nvidia’s previous Hopper platform, to 10 cents on Blackwell, then to 5 cents after adopting Blackwell’s native NVFP4 low-precision format. Hardware alone provided a 2x improvement, but reaching 4x required precise format conversion.
According to Nvidia, Sentient Foundation achieved 25% to 50% improved cost efficiency for its agentic chat platform by using Fireworks AI’s Blackwell-optimized inference stack. The platform streamlines complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency.
Nvidia said Decagon saw a 6x reduction in cost per query for AI-powered voice customer support by running its multimodal stack on Together AI’s Blackwell infrastructure. Response times remain under 400 milliseconds even when processing thousands of tokens per query, which is important for voice interactions where delays cause users to turn off the phone or lose trust.
Technical factors driving 4x vs 10x improvement
The cost reductions ranging from 4x to 10x across deployments reflect various combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precise format adoption, model architecture choice, and software stack integration.
Precision formats show the most obvious effect. The case of Latitude demonstrates this directly. Moving from Hopper to Blackwell reduced costs by 2 times through hardware improvements. Adoption of Blackwell’s original low-precision format NVFP4 doubled this improvement to a total of 4x. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computations per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model is active for each estimation request.
Model architecture matters. MoE models, which activate different specific sub-models based on inputs, benefit from Blackwell’s NVLink Fabric that enables rapid communication between experts. "Having those experts communicating on that NVLink fabric can help you reason very quickly," Harris said. Dense models that activate all parameters for each estimation do not effectively take advantage of this architecture.
Software stack integration creates additional performance delta. Harris said Nvidia’s co-design approach — where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together — also makes a difference. Besten’s deployment for Sully.ai used this integrated stack combining NVFP4, TensorRT-LLM, and Dynamo to achieve 10x cost reduction. Providers running alternative frameworks like VLLM may get less profit.
Workload characteristics matter. Reasoning models show a particular advantage over Blackwell because they generate significantly more tokens to reach a better answer. The platform’s ability to efficiently process these extended token sequences through a separate service, where context prefill and token generation are handled separately, makes logic workloads cost-effective.
Teams evaluating potential cost reductions should check their workload profiles against these factors. Higher token generation workloads will reach the 10x range using experts’ mix models with the integrated Blackwell software stack. Using dense models on alternative frameworks the reduced token volume will reach close to 4x.
What teams should test before migration
While these case studies focus on Nvidia Blackwell deployments, enterprises have several avenues to reduce estimated costs. AMD’s MI300 series, Google TPU, and specialized inference accelerators from Grok and Cerebras provide alternative architectures. Cloud providers also continue to optimize their estimation services. The question is not whether Blackwell is the only choice, but whether the specific combination of hardware, software, and models best suits particular workload requirements.
Enterprises considering Blackwell-based estimating should start by calculating whether their workload justifies the change in infrastructure.
"Enterprises need to step back from their workload and use case and cost constraints," Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat.
The deployments that achieved 6x to 10x improvements included high-volume, latency-sensitive applications processing millions of requests monthly. Teams running applications at low volume or with latency budgets greater than one second should explore software optimization or model switching before considering infrastructure upgrades.
Testing takes precedence over provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions.
"If it’s a highly latency-sensitive workload, they may want to test a few providers and see which one meets their minimum requirements while keeping costs down," He said. Teams should run real production workloads on multiple Blackwell providers to measure actual performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks.
The stepwise approach used provides a model for latitude assessment. The company first moved to Blackwell hardware and measured a 2x improvement, then adopted the NVFP4 format to reach a 4x overall reduction. Teams currently on Hopper or other infrastructure can test whether precise format changes and software optimizations capture meaningful savings on existing hardware before committing to a full infrastructure migration. Running open source models on existing infrastructure can potentially cut costs in half without new hardware investment.
Provider selection requires understanding software stack differences. While many providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia’s integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like VLLM. Harris acknowledges that a performance delta exists between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements, rather than assuming that all Blackwell deployments function the same.
The economic equation extends beyond cost per token. Specialized estimation providers such as Basetain, DeepInfra, Fireworks, and Together offer customized deployments but require the management of additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate the total cost including operational overhead, not just estimate pricing, to determine which approach provides better economics for their specific situation.
<a href