
Enterprise AI teams face a dilemma: The best models today may not be the best models a year from now. MassMutual’s answer is to stop making long-term bets – and build an infrastructure that can swap out models as the market changes.
“The world of AI today is extremely dynamic,” MassMutual CIO Sears Merritt explains in a new VB Beyond the Pilot podcast. “We wanted to make sure we were ready to ride that wave of mobility.”
This strategy appears to be largely successful. MassMutual has measured an increase of nearly 30% in developer productivity, while AI-powered contact center workflows have reduced resolution times from 10 minutes to one and reduced costs from dollars to cents.
But the broader lesson for IT leaders may be less about the results and more about how thoughtfully the company is building its AI infrastructure and putting users at the center.
Maintaining optionality for tomorrow’s possibilities
MassMutual works with leading vendors, but keeps those relationships in mind. “Those relationships have been limited so that we maintain optionality for best-of-breed devices as things mature in this space, and at some point, become settled and stable,” Merritt said.
That philosophy extends to the open-source model. Merritt says his team is focusing “100%” on open-source tools, and sees technology playing a big role in how MassMutual (and similar companies) use AI.
“We will certainly need leading models and leading capabilities to make what is impossible today possible tomorrow,” he said.
measuring results from the beginning
MassMutual’s AI efforts fall into two broad categories.
The first focuses on enablement: putting productivity-boosting tools like Copilot and virtual assistants into the hands of all employees. The second involves what Merit describes as “depth and focus” initiatives, where teams target a specific workflow or business process that will have a strong impact on advisors, policyholders or employees.
Instead of focusing on adoption metrics, these projects start with predefined success criteria. “Everything we do is measured,” Merritt said. “There’s always a success metric that we define beforehand to determine whether we’re going to scale up some of these things or not.”
The company is intentionally encouraging experimentation, giving employees access to best-in-class models, “token-consumable workflows” and other potential capabilities so they can assess the relative benefits of “simpler, lower cost” large language models (LLMs).
At the same time, MassMutual is collecting increasingly detailed analytics around usage patterns, developer workflow, model performance, and costs. The goal is to reduce spend while building the operational intelligence to ultimately move workloads to the right model based on cost, response quality and user experience.
Those insights will ultimately drive optimization decisions around model routing, prompt selection, response time, and infrastructure design.
“We’re gaining access to analytics that let us look at usage patterns, developer workflows in a very granular way and begin to understand who is using what, when and for what types of tasks,” Merritt said.
Why does MassMutual sometimes choose a more expensive model
Another interesting aspect of MassMutual’s approach is how it evaluates AI quality. Rather than focusing exclusively on benchmarks or token costs, the company uses a “trust score” framework by merit.
This process combines user feedback with operational metrics to understand how employees perceive AI-generated responses and whether those responses actually improve results.
Rebuilding the contact center tested that framework. During development, employees were provided access to two different LLMs. One produced responses in almost real time but the quality was noisier. The other more expensive option took several extra seconds to respond but consistently provided high-quality answers.
Conventional wisdom and business momentum might suggest that users would prefer the former; But he chose quality over quantity. Merit’s team asked users about the quality of feedback, their favorite models, and their overall thoughts on the experience.
Most of the time, users said: “We want more expensive. We’re willing to wait, but the difference in quality is so great that the two extra seconds are really worth it to us.”
That feedback ultimately determined which model MassMutual deployed.
“We incorporated that experience into the decision making, and it led us to say that on a relative basis, the costs were insignificant, so we’re going to use more complex models," Merritt said.
Listen to the full podcast to learn more about it:
- Why Mythos has “completely changed” the cybersecurity landscape – not the types of threats, but the rate at which those threats appear;
-
How a team of AI engineers modernized MassMutual’s mainframe in 7 days (a process that previously took 3 months);
-
Why MassMutual specifically avoided tokenmaxing to rein in AI usage and spending and is going “unlimited” to avoid cost overruns.
-
How a “multi-harness type environment” will support agentic AI.
You can also listen and subscribe beyond the pilot But spotify, Apple Or wherever you get your podcasts.
<a href