2 Years Of ML Vs. 1 Month Of Prompting

7 November 2025

Recalls cost major automakers millions of dollars per year. This is a big issue. To mitigate this, our company created an analytics department focused solely on classifying warranty claims into actionable problems.

For decades, this team has relied on SQL queries to classify warranty data. But vehicles—and the language used to describe them—have evolved. SQL struggles with semantics, negation, and contextual nuances. Here’s a hypothetical example of a claim we might see in the wild:

“Customer reported oil on driveway, thought engine was leaking. Detailed inspection found no engine leaks. Oil spill discovered during last oil change. Oil on subframe dripping onto ground. Subframe cleaned, verified no leaks from engine or drain plug. Customer advised.”

An oversimplified SQL query that might attempt to capture this scenario is:

SELECT
    claim_id,
    claim_text,
    CASE
        WHEN (
            (LOWER(claim_text) LIKE '%leak%'
             OR LOWER(claim_text) LIKE '%leaking%'
             OR LOWER(claim_text) LIKE '%seep%'
        AND
            (LOWER(claim_text) LIKE '%oil%'
             OR LOWER(claim_text) LIKE '%fluid%'
        AND
            LOWER(claim_text) NOT LIKE '%no leak%'
            AND LOWER(claim_text) NOT LIKE '%not leaking%'
        )
        THEN 1
        ELSE 0
    END AS is_leak
FROM warranty_claims;

What we can understand from this example is that the leak is from a service oil leak – not from the vehicle. Yet this query will mark it as leaked. In production, these types of queries turn into hundreds—if not thousands—of identical clauses. Over the years, the team created thousands of classification buckets. Many of these legacy buckets still cause claims fraud today—creating unnecessary work for analysts and slowing down the identification of new issues.

classification project

In 2023, the company launched a major initiative to automate warranty classification using supervised models. Here’s how it happened:

data collection: The first challenge was to establish the ground truth. Each team member had different mental models of how claims should be classified. After months of discussion, the team finally settled on a set of core “symptoms” to classify warranty claims. Then came the hard part: manually labeling thousands of complex claims per symptom – a task that only domain experts could handle. After several months we had labeled only half the symptoms.
Preprocessing: The raw warranty text is cluttered—full of acronyms, error codes, and multilingual input.

“Cust report coming up with p0420. Tech found A/C compressor clutch squealing at idle. Map sensor checked, readings normal. Replaced cat converter per TSB. DTC cleared, road test OK.”

Translation: Customer reports check engine light. The technician found an unrelated AC compressor problem. The catalytic converter was replaced as per the technical service bulletin. problem solved.

We created a 9-step preprocessing pipeline: text sanitization, concatenation, tokenization, acronym expansion, stop word removal, spell checking, service bulletin extraction, diagnostic code parsing, and translation. It took another 6 months.

Fun fact: Translating French and Spanish claims into German was the first to improve technical accuracy – an unexpected benefit of Germany’s automotive dominance.
Modeling: We tried several vectorization and classification approaches. Our data was highly unbalanced and skewed toward negative cases. We found that TF-IDF with 1-gram features coupled with XGBoost consistently emerged as the winner. See PR curve attached below [1]

Reaching production was another challenge. Migrating everything to the cloud, building the UI for our analytics team, onboarding vendors, and coordinating with IT – this project spanned several years. Our plan was to deploy the first 10 models, gather real-world feedback, and resume labeling for the remaining traits. But once the initial batch of classifiers went live, the project’s priorities changed: the scope expanded to deploying all the classifiers, while the team that helped with the first annotations moved on to new initiatives.

We suddenly faced the problem of data shortage. How do you deploy a model without training data? Even with renewed labeling efforts, it would still have taken several months to label the new dataset. We needed a faster, more flexible solution.

What about larger language models?

We actually tried some one-shot prompting with GPT-3.5 at the beginning of this project – but the results were disappointing: low accuracy, high latency, and prohibitive cost. Fast forward two years and the landscape has changed fundamentally. Modern models were faster, cheaper and showed strong few-shot performance in various domains. This raised a question: could we get within 5% of our purpose-built classifier?

To find out, we benchmarked 6 frontier models against our baseline using 5 labeled datasets with broad characteristics such as leak And Noise like narrow minded people cut-chipGiven our data was skewed toward negative cases, we chose PR AUC as the primary metric, supported by Matthews correlation coefficient (MCC) and F1, preliminary results? XGBoost is still ahead by ~15% on average, especially on the hardest tasks, although LLM has shown promise in a wider range of categories, (See chart below,)

Radar chart showing model performance in different categories

closing the gap

When we considered cost, the Nova Lite was the clear value pick – third best PR AUC score, yet second cheapest model [2]So we took it forward and started replicating it on our signals,

Our quick tuning combined valuation with logic. For each trait, we ran Nova Lite on a stratified sample of labeled data, capturing two outputs: the prediction and its logic. We compared the results to the ground truth, analyzed where the signals failed, and used those logic signals to identify gaps. Failure cases and current prompts were passed to a larger LLM to generate refinements. Each version was re-evaluated several times to confirm accuracy and remove noise. See step-by-step progress [3],

After 6 rounds of refinement, Nova Lite narrows the performance gap and matches or slightly surpasses our supervised XGBoost model in 4 out of 5 categories (Cut-chip, distort-misalign, leak, and noisewas the biggest leap cut-chipWhich improved by 35 points and went ahead of our baseline. broad categories such as Noise And leak Started strong and saw only modest gains. surface appearance The exception remains – still behind by 12 points, which suggests it may require a different modeling approach altogether.

Bar chart comparing XGBoost vs Nova Lite performance

So what?

Over several years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the title, but that’s not the point. The real change is that classification is no longer determined by data availability, annotation cycles, or pipeline engineering. The hurdles from collecting examples to writing instructions were removed. This is not a minor improvement; This is a different way of building a classifier.

Supervised models also make sense when you have static targets and millions of labeled samples. But in domains where taxonomy flows, data is sparse, or requirements change faster than you can annotate, LLM turns an impossible backlog into a quick iteration loop.

We didn’t just change a model. We changed a process.

[1] The PR curve is exploring different globalization methods.

PR curve showing globalization methods

[2] Price vs Performance Table

Sample	Cost per 1M token	PR AUC
cloud sonnet 4.5	$3.00	0.722
cloud haiku 4.5	$1.00	0.717
nova lite	$0.06	0.716
Lama 3.3 70B	$0.72	0.712
Lama 4 Maverick 17B	$0.24	0.709
nova micro	$0.04	0.600
Llama 4 Scout 17B	$0.17	0.575

, All LLM prices shown are on-demand. Batch pricing is ~50% lower.

[3] motivate with quick progress

Quick Fix Progress Chart

2 Years of ML vs. 1 Month of Prompting

classification project

What about larger language models?

closing the gap

So what?

Like this:

Related

Leave a Comment Cancel reply

classification project

What about larger language models?

closing the gap

So what?

Share this:

Like this:

Related

Leave a Comment Cancel reply