The company’s CEO has acknowledged that Cloudflare incorrectly suspected that the widespread outage that took many websites offline on November 18 was caused by a DDoS attack. However, in his blog post detailing what happened, Matthew Prince explained that after realizing his mistake, his team was able to fix the issue. “This issue was not caused directly or indirectly by a cyberattack or any type of malicious activity,” he wrote. Instead it was caused by a change in the permissions of its database system, which caused a problem with a file used by its bot management system.
The company’s bot management system uses a machine learning model to score bots for each request made while crawling Cloudflare’s network. Its customers rely on those bot scores to decide whether to allow or block specific bots from accessing their websites. One use of keeping a bot score is to be able to block bots of AI companies so that they cannot use the content of a website to train their LLMs. In July, Cloudflare launched an experiment called “pay per crawl”, which allows AI bots to crawl their pages while website owners get paid for access.
Prince said the model relies on a “feature” configuration file to predict whether a bot request was automated or not. The feature file is refreshed every few minutes, and a change in the underlying mechanism that generates that file caused a change in its size that caused the error. “As a result, HTTP 5xx error codes were returned by the core proxy system that handles traffic processing for any traffic that relies on the bot module for our customers,” Prince wrote.
This recent incident has been Cloudflare’s worst failure in years. The company said there were no outages causing “most major traffic flow.” [its] Network” from 2019. Prince apologized for the issue on behalf of his team.
