Cloudflare has become the latest web infrastructure giant to collapse in the span of a month, replacing entire sites including X, ChatGPT, Spotify, Canva, and even outage-tracking Downdetector with error messages for hours this morning. Mehdi Daoudi, CEO and co-founder of Internet performance monitoring platform Catchpoint, says this is the latest in a series of outages that should be a “wake up call” for companies.
“Everyone is putting all their eggs in one basket, and they’re surprised when a problem arises,” says Daudi. “It’s in the company’s favor to make sure they have redundancies and flexibility.”
The outage comes after problems affecting Microsoft Azure and Amazon Web Services occurred within exactly a week of each other, shutting down large parts of the Internet that rely on the major providers to keep their websites running. Cloudflare similarly powers a large portion of the Internet. It keeps websites online with its content delivery network, while providing a variety of other services, including DDoS attack protection and DNS. Last year, the company said that about 20 percent of the Web runs through Cloudflare’s network. It also serves 35 percent of the companies on the Fortune 500 list, in addition to “millions” of other customers.
Cloudflare’s fast performance and security record makes it a popular choice for websites around the world, but this latest outage draws attention to how centralized the web infrastructure industry has become. After an AWS outage shut down secure messaging app Signal, the service’s president, Meredith Whittaker, said the company had no other choice but to use a major cloud service provider to run it. “Practically the entire stack is owned by 3-4 players,” he wrote.
“Even small deviations can have big consequences.”
But even if companies rely on just a few web infrastructure providers, the last series of outages makes it clear that they need a backup plan. “There will be disruptions and they will continue to happen. The radius of the explosion will keep expanding,” says Dawoodi. The Verge“The question is, what are you doing about it?”
Although Microsoft and AWS linked their outages to DNS-related issues – a system that translates website domain names to IP addresses – Cloudflare tracked its outages to a file. “The root cause of the outage was a configuration file that is automatically generated to manage threat traffic,” said Jackie Dutton, a Cloudflare spokesperson. “The file exceeded the expected size of the entries and crashed the software system that handles traffic for Cloudflare’s many services.”
It may seem absurd that such a file issue could bring down massive amounts of the Internet, but for large companies like Cloudflare, it can. “When you operate infrastructure at the scale of Cloudflare, even small deviations can have big consequences,” explains Rob Lee, head of AI and research at the SANS Institute. The Verge“These platforms are built for speed, so anything that delays or prevents decision making can happen quickly, In high-performance environments, a delay of a millisecond can become a complete traffic blockage,”
According to Lee, a configuration file like Cloudflare describes “routing security policies, making load balancing decisions, and how traffic is distributed globally.” If the file size suddenly increases, “it can trigger slow parsing, memory problems, CPU contention, or logic failures inside the systems that depend on it,” Lee says.
AWS similarly blamed “faulty automation” for triggering a series of issues that recently led to widespread outages — the kind of error that’s bound to happen again. “Are you going to complain about this every time Cloudflare sneezes?” Dawoodi says. “Or are you going to build around it?”
