A Cloudflare outage took out a large portion of the internet on Tuesday, leaving users unable to access many sites and services like X, ChatGPT, Spotify, YouTube and Uber. The cybersecurity company has now published a blog post explaining what exactly happened.
Why does the Internet keep crashing so often? First Google Cloud, then AWS, now Cloudflare.
Cloudflare co-founder and CEO Matthew Prince apologized in a post late Tuesday, saying the outage was the company’s worst experience since 2019.
,[I]”We haven’t had another outage in 6+ years that stopped most of the core traffic from flowing through our network,” Prince said. “On behalf of the entire Cloudflare team, I want to apologize for the Internet disruption today.”
Prince explained that the Cloudflare outage was caused by a problem in the system it uses to protect websites from DDoS attacks.
Cloudflare’s outage, explained
This tweet is currently unavailable. It may be loading or may have been removed.
Cloudflare’s bot management system is a service that protects websites from malicious bot attacks. These include DDoS attacks that flood highly trafficked websites, content scraping attacks that collect data from websites without authorization, and autonomous credential stuffing attacks that attempt to gain access to websites using login details leaked from other sites.
mashable light speed
This bot management system includes an AI model that scores traffic requests. Whenever an attempt is made to access a website protected by Cloudflare’s bot management, the AI generates a score to determine whether it is likely to be from a bot. To do this, the AI considers various characteristics of the request, which are contained in a “feature file”.
The feature file is where the problem originated. This file is refreshed every five minutes to stay up to date with emerging bot behaviors, and is used across Cloudflare’s entire cybersecurity network. However, the company implemented a change in the underlying query that generated the file, which led to a large amount of information being duplicated. This caused the feature file to become larger than normal, causing an error in the bot management system.
As a result, an error code occurred when attempting to access websites that used Cloudflare’s bot management system. Cloudflare says critical failures began occurring on its network about 15 minutes after the feature file generation update was implemented.
Cloudflare initially suspected that the outage was a malicious attack, especially since its status page went down despite being independent of the company’s infrastructure. However, Prince said it turned out to be a coincidence.
“This problem was not caused directly or indirectly by a cyberattack or any type of malicious activity,” Prince stressed. “While we initially incorrectly suspected that the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the root problem and were able to stop the spread of the larger-than-expected feature file and replace it with an older version of the file.”
When reached by Mashable ahead of the blog post, a Cloudflare spokesperson also emphasized that “there [was] no proof of this [the outage] “This was the result of an attack or caused by malicious activity.”
Cloudflare’s services were restored within roughly three hours, and were fully restored after about five hours. Prince said the company is already planning measures to prevent similar outages in the future, including preventing error reports from overwhelming its systems.
