CDN provider Cloudflare suffered a massive outage today. Some of the world’s most popular apps and web services became inaccessible for serveral hours while the Cloudflare team struggled to recover entire parts of the internet.
And this can be a good thing.
The proximate cause of the outage was pretty common: a bad configuration file triggered a secret bug in one of Cloudflare’s services. The file was very large (details are still blurry) and caused a widespread failure in Cloudflare operations. Perhaps there is some useful post-mortem about canary releases and phased rollouts.
But the bigger problem, the ultimate cause, behind today’s chaos is the rapidly increasing centralization of the Internet and a society that treats the Internet as if it is sleeping. always on And always working,
It’s not some “trivial” thing like Twitter League of Legends They too were impressed. A friend of mine made a scathing comment about his experience this morning
I couldn’t get air for my tires in two garages due to cloud burst. Killers like the lack of flexibility that comes with a design when the machine says “cash only” and has no cash slots. Flat tire for everyone! Fabulous.
We live in a society where every part of our lives is increasingly connected via the Internet: work, banking, retail, education, entertainment, dating, family, government IDs, and credit checks. And the Internet is increasingly tied into fewer and fewer points of failure.
This is ironic because the Internet was actually designed for decentralization, a system that governments could use to coordinate their response in the event of nuclear war. But because of the challenges of the Internet economy, bots, and things like scrapers, more web services are hidden in bastions like AWS or behind content delivery networks like Cloudflare.
outages like today are a good thing because they are a alertThey can force redundancy and flexibility into the system, They can provide credible alternatives to the pillars of our society – governments, businesses, banks – when things go wrong,
(ideally ones that are completely offline)
You can compare this to how COVID-19 shook up global supply chains: The logic until 2020 was that you wanted your systems to be as lean and efficient as possible, even if that meant relying entirely on international supplies or holding as little excess stock as possible. After 2020, businesses realized that they needed to diversify and bring slack into the system to absorb shocks.
Just as growing one type of banana almost made bananas extinct, we are moving towards a society that cannot survive without digital infrastructure; And a digital infrastructure that cannot function without two or three key players. One day there is going to be an outage, bug or cyber attack from a hostile state, which shows how fragile that system is.
Accept outages, and build in redundancies.
<a href