Cloudflare Global Outage Traced To Internal Database Change

Cloudflare recently experienced a global outage due to a database permissions update, which caused widespread 5xx errors across its CDN and security services.

The disruption began around 11:20 UTC on November 18, disrupting access to customer sites and even locking out Cloudflare’s own team from their internal dashboard. According to a post-mortem released by CEO Matthew Prince, the root cause was a subtle regression introduced during a routine repair of their ClickHouse database cluster.

Engineers were rolling out a change designed to improve security by making table access clearer to users. However, this update had a nasty, unexpected side effect on the bot management system. A metadata query that historically returned a clean list of columns from the default database suddenly starts pulling duplicate rows from the underlying r0 database shard.

Prince explained the technical details in a blog post:

The change… resulted in all users having access to accurate metadata about the tables they have access to. Unfortunately, previous assumptions assumed that the list of columns returned by such a query would only include the ‘default’ database.

This additional data caused the “feature file”, a configuration set used to track bot threats, to double in size. Cloudflare’s main proxy software pre-allocates memory for this file as a performance optimization, but it has a strict security limit of 200 features. When the bloated file hits the network, it exceeds that limit, causing the bot management module to crash.

The graph showing a line of AI-generated content in yellow may be inaccurate.

(Source: Cloudflare blog post)

The phenomenon was challenging to diagnose due to its presentation. Because database updates were being released slowly, the system would switch between a “good” state and a “bad” state every few minutes. This erratic behavior initially convinced the engineering team that they were fighting a hyper-scale DDoS attack rather than an internal bug. The confusion reached a peak when Cloudflare’s external status page also went down, a complete coincidence that led some to believe support infrastructure was being targeted.

One respondent on the Reddit thread commented:

You don’t realize how many websites use Cloudflare until it stops working. Then you try to see how many websites use Cloudflare, but you can’t because all the Google results that answer your question also use Cloudflare.

“There was a time when our network was not able to route traffic, much to the chagrin of every member of our team,” Prince wrote. This was the company’s most significant outage since 2019, he wrote.

While users grappled with the outage, Cyber Couture CEO Dickie Wong touted the incident as validation of multi-vendor strategies. In response to the event, he commented that while Cloudflare offers a great suite of tools, “love is not the same as marriage without a prenup.” Wong argues that risk management requires a lifestyle change toward proactive multi-hybrid strategies to avoid the “single-point-of-failure physics” that defines this outage.

This sentiment was echoed by users on the r/webdev subreddit, where user CrazyRebel123 noted the fragility of the current Internet landscape:

The problem these days is that you have a few big companies that run or own most of the things on the Internet. So when one of them goes down, the entire Internet goes in that direction. Most sites now run on AWS or some other type of cloud service.

Senior technology leader Jonathan Bee reinforced this view on LinkedIn, criticizing the tendency for organizations to bet farms on a single vendor for the sake of “simplicity.”

It’s simple, yes – right until that vendor outage that everyone is tweeting about… People call hybrids ‘old school’, but honestly? This is just responsible engineering. It’s acknowledging that disruptions happen, no matter how big the logo on the edge of the cloud.

The service was eventually restored by manually pushing a known-good version of the configuration file to the delivery queue. Traffic flow returned to normal by 14:30 UTC, with the incident fully resolved by late afternoon. Cloudflare says it is now reviewing the failure modes in all of its proxy modules to ensure that memory pre-allocation limits can handle bad input more gracefully in the future.

<a href=

Cloudflare Global Outage Traced to Internal Database Change

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply