
Cloudflare’s proxy service has limits to prevent excessive memory consumption, with the bot management system having “a limit on the number of machine learning features that can be used at runtime.” This limit is 200, which is much higher than the actual number of features used.
“When a corrupted file containing over 200 features was transmitted to our servers, this limit was hit – resulting in a system panic” and output errors, Prince wrote.
Worst Cloudflare outage since 2019
The number of 5xx error HTTP status codes provided by the Cloudflare network is normally “Very Low”, but has increased as the bad file spread across the network. “The surge and subsequent fluctuations indicate that our system is failing due to an incorrect feature file being loaded,” Prince wrote. “The noteworthy thing is that our system will recover for some time. This was very unusual behavior for an internal error.”
This unusual behavior was explained by the fact that “the file was being generated every five minutes by a query running on the ClickHouse database cluster, which was being updated gradually to improve permissions management,” Prince wrote. “Bad data was only generated when queries were run on the part of the cluster that had been updated. As a result, a good or bad set of configuration files was likely to be generated every five minutes and rapidly disseminated across the network.”
This fluctuation initially “led us to believe it might be caused by an attack. Eventually, each ClickHouse node was generating a bad configuration file and the fluctuations stabilized in a failed state,” they wrote.
Prince said Cloudflare addressed the issue by “stopping the creation and propagation of the bad feature file and manually inserting a known good file into the feature file delivery queue,” and then “forcing our core proxy to restart.” The team then worked on “resuming the remaining services that had reached a degraded state” until 5xx error code volumes returned to normal later in the day.
Prince said the outage was Cloudflare’s worst since 2019 and the company is taking steps to protect against similar failures in the future. Cloudflare will work on “hardening the ingestion of Cloudflare-generated configuration files in the same way we do for user-generated input; enabling more global kill switches for features; eliminating the ability for core dumps or other error reports to impact system resources;” [and] Reviewing failure modes for error conditions in all main proxy modules, according to Prince.
While Prince can’t promise that Cloudflare will never suffer another outage of the same scale, he said past outages have “always pushed us to build new, more resilient systems.”