The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

inferencesense smk1
Every GPU cluster has a dead time. Training tasks end, workloads change, and hardware goes dark while power and cooling costs continue. For neocloud operators, those empty cycles are lost margin.

The obvious solution is the spot GPU market – renting out excess capacity to whoever needs it. But spot instances mean the cloud vendor is still renting, and the engineers buying that capacity are still paying for raw compute without an inference stack.

FriendliAI’s answer is different: run estimations directly on unused hardware, optimize for token throughput, and split the revenue with the operator. FriendlyAI was founded by a researcher named Byung-gon Chun, whose paper on continuous batching became the basis for VLLM, the open source inference engine used in most production deployments today.

Chun spent more than a decade as a professor at Seoul National University studying the efficient execution of large-scale machine learning models. That research produced a paper called orcaWhich introduced continuous batching. The technology processes estimate requests dynamically rather than waiting for a certain batch to be filled before execution. This is now the industry standard and is the main mechanism inside VLLM.

This week, FriendliAI is launching a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a share of token revenue. The operator’s own jobs always take priority – the moment a scheduler reclaims a GPU, inference occurs.

"What we’re providing is that instead of letting GPUs sit idle, they can monetize those idle GPUs by running inference," Chun told VentureBeat.

How Seoul National University Lab built the engine inside VLLM

Chun founded FriendliAI in 2021, before most of the industry focused on inference from training. The company’s primary product is a dedicated inference endpoint service for AI startups and enterprises running open-weight models. FriendliAI also appears as a deployment option at Hugging Face with Azure, AWS, and GCP, and currently supports over 500,000 open-weight models from the platform.

InferenceSense now extends that inference engine to the capacity issue faced by GPU operators between workloads.

how it works

InferenceSense runs on top of Kubernetes, which most neocloud operators are already using for resource orchestration. An operator allocates a pool of GPUs to a Kubernetes cluster managed by FriendliAI – declaring which nodes are available and under what circumstances they can be reclaimed. Passive identity runs through Kubernetes itself.

"We have our own orchestrator that runs on GPUs from these neocloud – or just cloud – vendors," Chun said. "We certainly leverage Kubernetes, but the software running on top is actually a highly optimized inference stack."

When the GPU is not used, InferenceSense spins up separate containers serving paid inference workloads on open-weight models, including DeepSeq, Queue, KM, GLM, and Minimax. When the operator’s scheduler requires hardware, the estimation workload is undone and returned to the GPU. FriendliAI says the handoff happens within seconds.

Demand is aggregated from FriendliAI’s direct customers and through estimate aggregators like OpenRouter. The operator supplies the capacity; FriendliAI handles the demand pipeline, model optimization, and serving stack. There are no upfront fees and no minimum commitment. A real-time dashboard shows operators which models are running, tokens being processed and revenue being earned.

Why does token throughput beat raw capacity rent?

Providers like CoreWeave, Lambda Labs, and RunPod have spots in the GPU market where the cloud vendor rents its own hardware to a third party. InferenceSense runs on hardware that the NeoCloud operator already owns, the operator defines which nodes participate and sets up scheduling agreements with FriendliAI in advance. The difference matters: spot markets monetize capacity, inference monetizes tokens.

Token throughput per GPU-hour determines how much InferenceSense can actually earn during unused windows. FriendlyAI claims its engine provides two to three times the throughput of a standard VLLM deployment, although Chun says this figure varies by workload type. Most of the competing inference stacks are built on Python-based open source frameworks. FriendliAI’s engine is written in C++ and uses a custom GPU kernel instead of Nvidia’s cuDNN library. The company has created its own model representation layer to partition and execute models in hardware, with its own implementation of speculative decoding, quantization, and KV-cache management.

Since FriendliAI’s engine processes more tokens per GPU-hour than the standard vLLM stack, operators should generate more revenue per unused cycle by setting up their own inference service.

What AI Engineers Evaluating Estimate Costs Should Look For

For AI engineers evaluating where to run predictive workloads, the neocloud vs. hyperscaler decision typically comes down to price and availability.

InferenceSense adds a new idea: if NeoCloud can monetize idle capacity through inference, they have a greater economic incentive to keep token prices competitive.

This is no reason to change infrastructure decisions today – it is still early. But engineers with an eye on total inference cost should watch to see whether NeoCloud’s adoption of platforms like InferenceSense puts downward pressure on API pricing for models like DeepSeek and Quen over the next 12 months.

"When we have more efficient suppliers, overall costs will reduce," Chun said. "With InferenceSense we can contribute to making those models cheaper."



<a href

Leave a Comment