The Inference Economy: Part II

Last week, we wrote about some of these Estimate trends in economy: Stabilizing token costs, using the right model for the right task, managing different methods of calculation and impact on pricing. Our perspective for the purposes of that post was derived from looking at what was happening with LLM estimates in general in recent months. Of course, we’re always thinking about how this impacts our decision making as app builders, and we discussed that briefly, but mostly focused on what was actually happening to the token cost.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F66e68069 88e6 41f1 834c — Source: GPT-5.

Looking at that post, we learned that the underlying assumption was that the dynamics of token consumption – from the perspective of application builders – is also changing. Those changes are likely driving the data center buildout in the first place and driving the demand side of the equation, which is forcing supply to keep up. As a result, today’s post is about what is driving demand for tokens and how you, as a token consumer, should think about managing your demand.

What do we mean when we say there has been a change in the demand side of the equation? Simply put, we are all using More tokens to process more dataThe increased token volume is partly driven by increased usage, but that is far from the whole story, Yes, of course the usage of AI applications is high, but what is more interesting and challenging – is that we are seeing an increasing trend in token consumption, per requestNot just an increase in the total number of requests processed. The driver behind this is ultimately quality.

As we’ve said many times on this blog, getting the behavior you want from an LLM is all about providing a model with the right information at the right time. If your context is wrong, you will get bad results. This means the million dollar question is how you will get the right information.

Search was the first solution we all turned to – first, vector search, then reverting to more traditional text search mechanisms. However, very quickly, we all read an input into the LLM and evaluated how relevant it was to the problem we were solving (“reranking”). Currently, Vic Singh, who is now CVP at Microsoft, Told us this 2 years ago: “If LLMs were fast enough… why not use LLMs to do more advanced similarity searches… I think that’s what people really want.”

The LLM-preprocess-data paradigm is widespread in our systems today. In RunLLM, we pre-read data at the time of ingestion to organize it correctly, we read the results of text + vector searches to analyze their relevance to a query, we analyze logs and dashboards in real-time with LLM, and so on. Each of those tasks is performed by a model call in solitude To understand whether that data should be used to make decisions in the future. Without LLM’s involvement in these stages, it would be almost impossible for us to provide high-quality results to our clients. At RunLLM we have a joke internally: The solution to every problem in computer science is another layer of abstractionAnd The solution to every problem in AI is another layer of what LLM calls,

This means that average – and perhaps more importantly, P99 – token consumption (and therefore request costs) is increasing very quickly. We’re all solving harder problems, which means we’re throwing more data at the LLM and ultimately consuming dramatically more tokens. In our view, this is one of the key drivers of increasing demand for the token.

Fortunately for data center builders, this trend is not going anywhere. We may get more efficient and cheaper heuristics (although we doubt it, as we talked about last week), but as LLMs become more integrated into every application and workflow, the use of per-request tokens is only going to increase, not decrease. As a single data point, we have a lot of ideas about how we can throw more LLMs at some of the challenges we face in the same investigation in RunLLM, but at the moment we are limited primarily by cost, latency, or evaluation.

If you’re inevitably going to use more tokens, it’s worth thinking about how to be as thoughtful as possible about those tokens.

We are confident that demand for the token is increasing, and as we discussed last week, token prices are stabilizing. Depending on how long it takes to build and power these new datacenters, this means we all need to think about managing our token usage, especially as models become better and more expensive. We’ve been using many of these techniques at RunLLM for some time now, so we thought we’d share some first-hand lessons.

Model size is your best friend. Not all models are created equal, nor do all functions equal. Putting your biggest model on every job will probably maximize quality, but it will drain your budget faster than you can imagine. (We accidentally spent $63 on a single check in RunLLM last month. 😱) There are a lot of things we do — gathering queries, filtering documents, synthesizing logs — that aren’t hard, but require processing the data efficiently. For simple tasks, there’s really no reason to use a state-of-the-art model – GPT-4.1 Mini (one of our current favorites) or a smaller open-source equivalent will get the job done just fine. Unfortunately, we don’t have any cut-and-dry rules for when to use which model. Right now this is more of an art than a science, but evaluation frameworks for specific tasks will definitely help guide you in the right direction.
Be flexible with your providers. We have long believed that LLM findings are a race to the bottom. If models get better, the main question becomes who can give you that model as cheaply as possible – especially with open weight models. However, we touched on the fact that switching model providers is harder than ever last week because model providers are making stronger assumptions, but tools like DSPy make quick customization easier than ever, which should alleviate some of that stress. Although you may not want to be willing to switch between every model provider on the market (there are so many!), being willing to use one of a few different providers when you have the opportunity – or even using features like batch mode within individual providers may be worth your time. The biggest issue with this is really security and compliance: more data subprocessors create more data exposure risks and make your vendor approval more difficult. But if you’re in an area where this is less of a concern, keeping your options open is definitely an option to reduce costs.
Do you need logic? Reasoning models use much more tokens than regular LLMs, and are accordingly much harder to control the output cost. It’s worth asking whether you need a logic model or not. For daily personal use, we default to ChatGPT 5 thinking, but In fact we are not currently using any logic models in the production of RunLLM.We’ve had much better luck breaking problems down into finer steps, using regular Python for orchestration and tool calling, and choosing the right model for the right task (see above), Interestingly, it mirrors some of the logic-based task planning workflows we see in our daily use, but with much stronger guardrails, Of course, we’re not addressing the problems with the pervasiveness of a consumer app like ChatGPT, so we have narrow scores and very strong guardrails, But for many workflow/task automation-oriented applications, it may be more efficient than you think,
Don’t jump straight into fine-tuning/training. RL is a hot topic at the moment. As we mentioned last week, New custom autocomplete model of cursor There has been a resurgence of enthusiasm for fine-tuning models for custom tasks. It’s certainly tempting: you take a small model, feed it lots of data, and voila – cheap, fast inference. Unfortunately, the reality is not so simple. for one thing, Times are tough after RL and trainingThe promise of the recent wave of RL environment startups for post-training is that they will help remove the complexity here, but we’re not convinced this is a viable solution, The hard part isn’t running an algorithm to update the weights – it’s formulating the problem in a way that will actually get the results you want and getting enough data that fits that problem formulation, (it is no new challenge In RL.) What has been lost in the hype around cursors is that they have collected a huge amount of data well suited for RL from natural use of the product – every tab completed suggestion was accepted or rejected, which is a very friendly RL framing. In contrast, we have over 1MM question-answer pairs from RunLLM, but a small fraction of them have actual feedback, and only a fraction of them have actionable feedback – for example, we get most of our negative feedback on “I don’t know” answers, which are generally good because there is not enough data to provide an answer. If you are in a domain where you have enough data and expertise + resources to utilize after training, then it is definitely viable from a unit margin perspective. But it is not the panacea that everyone is claiming.

What’s interesting about these dynamics at the moment is that we are all focused on costs, but not as much on our pricing power. Of course, any business will always want to reduce its COGS – the more efficient you are, the better your business will scale. This is probably the right place to be, given the fierce competitive dynamics in many AI markets. At the same time, while we are working on technical solutions to reduce COGS, we are also conscious of the fact that as AI applications mature and ROI becomes more apparent, we are likely to see a corresponding increase in pricing power. The best applications will likely command a significant premium. This certainly won’t apply in every market – only in those markets where quality matters most.

Speculations aside, it’s clear that the economics of AI are changing faster than we expected. The sudden drop in per-token cost changes coincides with more mature applications requiring more tokens – a double whammy for increased costs. There will be other solutions (technological and non-technological) that will change these dynamics, but for the foreseeable future, we will all be keeping a close eye on our OpenAI bills.

<a href

The Inference Economy: Part II

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply