We Upgraded to a Frontier Model and Our Costs Went Down

Last week we wrote about feeding terabytes of CI logs to LLM. Most of the questions on Hacker News were not about logs. They were about the agent: which models, how they coordinate, and how much it costs.

Today we run Opus 4.6 and pay less than when we ran everything on Sonnet 4.0.

This is mostly due to opus Not there. Do: 80% of failures never reach him, and when they do, he never reads the log lines.

The architecture looks like this:

Pipeline: Haiku Triagger handles 80% of failures cheaply, sending 20% ​​to Opus Orchestrator which assigns Haiku workers.

Let a cheap agent decide whether an expensive agent is needed

Last week we analyzed nearly 4,000 CI failures. There were 818 new problems. The other 3,187 were a known problem reoccurring: a flaky test, an infrastructure hiccup, a network glitch we’d already detected.

There’s no point in waking up to an expensive model when in 80% of the cases the answer is “it’s a duplicate”. Unfortunately, we can’t detect duplicates with certainty: the same task can fail multiple times for completely different reasons, so you really have to look at the logs to know if you’ve seen it before.

We initially used Sonnet to balance cost and performance. It worked, but it was the worst of both worlds: still expensive, and the results weren’t as good as the Frontier models.

We switched to the “Treasure” pattern: a Haiku agent whose job is very specific and narrow. Is this issue already tracked or not? If so, then stay there. If not, move up to Opus.

Detecting duplicates with Haiku proved to be a bit challenging. We needed to make the task as easy as possible, so we attached error messages to previous failures and gave Haiku two search tools: exact matches for known error snippets, and semantic search (PGVector) for similar-but-not-identical errors. RAG is dead, but semantic search is pretty neat. operator does not exist bigint character varying And migration type mismatch on installation_id There are different strings but the root cause is the same, and semantic search brings it to the surface.

The Haiku agent reads the logs, searches for error messages, tries to match known failures, and makes calls. When there is doubt, it increases. A false positive costs a little money; A false negative means we missed something real.

4 out of 5 failures never reach Opus. A triage match costs about 25 times less than a full investigation.

Let the agent pull the reference, don’t push it

Many people asked how we handle logs with 200K+ lines. We don’t push them into the prompt. We give the agent a SQL interface to ClickHouse and let it ask it what it needs.

The reason for this is not just symbolic cost. If you hand an agent a specific set of log lines, you end up making a decision about what’s relevant before you even know what the problem actually is. The agent depends on what you give. If the real cause lies elsewhere, you have made it harder to find. This is why you don’t want to lead a debugging session by saying “I think the problem is in this file”: you’ve committed bias before the investigation has even begun.

We wrote about SQL setup in detail last week, but the short version: There is a table with the raw data (github_logsone line per log line) and a set of materialized views with pre-collected data: workflow, task time, failure rate by result calculation. Most investigations start with a physical approach to narrow down the cause, then drill down into the raw logs if needed.

We do not tell the agent which table to query. Instead, we use the reactions themselves to guide it progressively. If a query returns too many rows, we shorten it and suggest a more specific materialized view. If the logs are not ingested yet, we point it to the GitHub CLI. The agent figures out what it needs without us having to guess every path in advance.

Expensive agents plan, cheap agents act

Opus looks at what failed, creates a hypothesis, and sets up Haiku sub-agents to do the real digging. Each sub-agent receives a signal from Opus: exactly what to search for, how to search for it, what to return. Sub-agents are limited to one level of depth; They cannot generate their own sub-agents. Unlimited fan-out leads to higher costs.

A few weeks ago three Storybook CI jobs failed on the same commit, all crashed pnpm install.

Opus Orchestrator is working through storybook checks in two rounds, fetching logs to Haiku sub-agents, query failure history, and verifying what changes have occurred.

Opus started by asking a sub-agent to fetch the error messages from the failed PNPM install step. ClickHouse didn’t have logs yet, so the sub-agent fell back to the GitHub CLI.

Sub-Agent #1 prompt:

Get the CI logs for this run. Return exact error message from pnpm install step, full error output, especially last 50-100 lines.

Result: gyp ERR! not found: make. re2@1.23.0 could not be compiled because make Wasn’t on the runner.

Opus searched existing information (no matches), then queried ClickHouse for failure trend over 14 days:

Feb 23: 0.2% failure rate
Feb 24: 1.1%
Feb 25: 8.0%  <- inflection point

Something clearly changed on February 25. opus was born Sub-Agent #2: :

Check out what changed around February 24-25. The failure rate increased from 0.2% to 8%. the error is gyp ERR! not found: make. Run git log on the workflow file and run the package.json for that window.

Build dependencies were removed during an unrelated migration. Correct for that migration, but re2 is still needed make To compile natively. opus was born Sub-Agent #3 To verify the current workflow state, then created insights with root cause and solution.

Orchestrator never reads logs, git history, or a single line of code.

Some things to note:

Cost. Haiku handles ~65% of all input tokens but only ~36% of our LLM spend. Expensive model thinks; Reads cheaper model. Without model hierarchy, the daily bill more than doubles.

Opus’ plan continues the same way. It starts with a hypothesis, but the results of each sub-agent shape the next step. In this investigation he found the error, searched the history, then asked what changed. Each round informed the next. More than a third of our investigations run multi-round, and new issues require nearly twice the depth of investigation as known issues.

Context cleanliness. The context of the orchestrator remains clean: structured summaries from sub-agents, not raw log output. Each sub-agent starts with a clean slate and when it is complete its context is discarded. Tool call output accumulates rapidly, and already outdated context in one session impairs decisions later.

Guided search. “Return the exact error message from the PNPM install step” is a very different signal from “Analyze these logs”. Opus decides what to watch; Haiku finds it. Haiku’s input/output ratio is 86:1 (reads a lot, returns focused excerpts), while Orchestrator is about 50:1 (synthesizes and makes decisions). Haiku absorbs the data so Opus doesn’t have to.

6 months ago this was not possible

Six months ago we were on Sonnet 4.0. It struggled writing correct ClickHouse queries: wrong tables, missing filters, reading too much data. Haiku 4.0 wasn’t useful for anything beyond yes/no classification.

Today Opus 4.6 can plan investigations and write precise sub-agent signals. The Haiku 4.5 can handle narrow, directed tasks because the scope of tasks is so tight that a faster cheaper model can perform them.

Upgrading to the Frontier model reduced costs.

pattern generalization

We created this for CI logs, but this pattern applies to anything with high event volume: security logs, IoT telemetry, financial data. Most events are noisy or repetitive, and expensive models should only look for those that aren’t.

There is a fourth layer that we haven’t covered: reappraisal. The system periodically checks whether the conclusions it has drawn are still true, turning off old insights, catching false positives, verifying that the improvements are working. That’s a post in itself.

We’re still tuning where the sub-agent limit sits. Sometimes it costs more to create a sub-agent than to do it inline because the setup overhead outweighs the savings.

The hardest part was not making the agent smarter. It was building up layers that prevent it from moving when it shouldn’t.



<a href

Leave a Comment