
Enterprise AI programs rarely fail because of bad ideas. More often than not, they get stuck in uncontrolled pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and MassGeneral Brigham explained how they avoided that trap — and what the results look like when discipline replaces dispersion.
At MassMutual, the results are tangible: 30% developer productivity gains, IT help desk resolution time dropped from 11 minutes to one minute, and customer service calls dropped from 15 minutes to just a minute or two.
“We always start with why do we care about this problem?” Sears Merritt, head of enterprise technology and experience at MassMutual, said at the event. “If we solve the problem, how do we know we’ve solved it? And, how much value is attached to doing so?”
Defining metrics, establishing strong feedback loops
MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the entire business – customer support, IT, customer acquisition, underwriting, servicing, claims and other areas.
Merritt said his team follows the scientific method, starting with a hypothesis and testing whether it has any results that will advance the business. Some ideas are great, but they may be “difficult in business” due to factors such as lack of data or access, or regulatory hurdles.
“We won’t move forward with an idea until we’re absolutely clear about how we’ll measure, and how we’ll define success.”
Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define a minimum level of quality before putting a tool into the hands of teams and partners.
That starting point creates an instant feedback loop. “One of the things we find slows us down is where there isn’t shared clarity on what outcome we’re trying to achieve,” Merritt said, which can lead to confusion and constant readjustment. “We don’t go into production unless there’s a commercial partner who says, ‘Yes, this works.'”
His team is strategic about evaluating emerging devices, and is “extremely rigorous” when conducting testing and measurements. "Good" Meaning. For example, they perform trust scoring to reduce hallucination rates, establish thresholds and evaluation criteria, and monitor feature and output drift.
Merit also works with a no-commitment policy – meaning the company doesn’t bind itself to using any particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best-of-breed models alongside mainframes running COBOL. That flexibility is not accidental. His team created common service layers, microservices, and APIs that sit between the AI layer and everything below it — so when a better model comes along, replacing it doesn’t mean starting over.
Because, Merritt explained, “the best of breed today may be the worst of breed tomorrow, and we don’t want to set ourselves up to back out.”
Weeding instead of letting thousands of flowers bloom
Mass. General Brigham (MGB), for his part, initially took more of a spray-and-pray approach.
About 15,000 researchers in the nonprofit health system have been using AI, ML and deep learning for the past 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.
But last year, he made a bold choice: His team shut down several unregulated AI pilots. In the beginning, “We followed the bloom of a thousand flowers [methodology]But we didn’t have a thousand flowers, we had maybe a few tens of flowers that were trying to bloom,” he said.
Like Merritt’s team at MassMutual, MGB looked toward a more holistic approach, examining why they were developing certain tools for specific departments of the workflow. They questioned what capabilities they wanted and needed and what investments they required.
Sriraman’s team also spoke to its primary platform providers – Epic, Workday, ServiceNow, Microsoft – about their roadmap. This was a “pivotal moment”, he said, as they realized that they were building in-house tools that the vendor already provided (or were planning to implement).
As Sriraman said: “Why are we building it ourselves? We’re already on the platform. It’s going to be in the workflow. Take advantage of it.”
That said, the market is still nascent, making difficult decisions difficult. “The analogy I would give is when you ask six blind people to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’ll get six different answers.”
That said, there is nothing wrong in it; It’s just that as the landscape is changing, everyone is exploring and experimenting.
Instead of a Wild West environment, Sriraman’s team distributes Microsoft Copilot to users throughout the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token usage.
He also began “intentionally including AI champions” in business groups. “It is the opposite of letting a thousand flowers bloom, carefully planting and nurturing them,” Sriraman said.
Observability is another big consideration; He describes real-time dashboards that manage model drift and security and allow IT teams to control AI “a little more hands-on.” Health monitoring is important with AI systems, he said, and his team has established principles and policies around the use of AI, not least mentioning access privileges.
In clinical settings, the guardrails are absolute: AI systems never issue final decisions. "There is always a doctor or a physician assistant to make decisions," Sriraman said. He cited radiology report generation as an area where AI is used heavily, but where the radiologist always signs off.
Sriraman was clear: "You shall not: Do not reveal PHI [protected health information] Confused. Simple as that, right?"
And, importantly, there must be safety mechanisms in place. Sriraman insisted, “We need a big red button, kill it.” “We don’t put anything into an operational setting without it.”
Ultimately, while agentic AI is a transformative technology, the enterprise’s approach to it should not be dramatically different. “There is nothing new in it,” said Sriraman. “You can change the word BPM. [business process management] With AI from the 90s and 2000s. The same concepts apply.
<a href