A few years ago, I gave a conference talk called “Build vs. Buy: Software Systems in Jurassic Park,” where I argued that the real villain wasn’t the Velociraptor or the T-Rex – it was Dennis Nedry’s custom software. The park’s catastrophic failure was not simply due to a disgruntled programmer; It was about choosing the critical infrastructure construction that should have been purchased. You can watch the whole thing here, but this week’s events make the lesson worth a second look.
In the span of a few days, we have seen some of the Internet’s most critical infrastructure destroyed. Cloudflare had a major outage today that took down a large portion of the web. GitHub went down. There were problems with AWS last week. And while each failure has its own specific cause, they all highlight the same fundamental problem: We’ve built our businesses on abstractions we don’t understand, controlled by companies we can’t influence.
The simple rule is that everyone is wrong
The thing is, If your core business function depends on a capability, you must own it If possible. You need to control your destiny, and you need to take advantage of every opportunity to become better than your competitors. If you only buy “that thing you do”, why should anyone buy it from you?
But tech leaders continually take it backwards. They’ll spend months building their own analytics tools, while running their entire product on a cloud provider they don’t understand. They’ll create artisanal monitoring solutions, while their actual business logic – the stuff customers pay for – runs on someone else’s computer.
infrastructure trap
Of course, there are exceptions. Sometimes you are unable to do something on which you depend because of expertise or ability. As a software provider, I need servers, networks, and datacenters to deliver my software, but I can’t afford to build a datacenter.
But this is where most companies go wrong: just because I need some infrastructure doesn’t mean I should go to a full cloud provider. I need some servers. I don’t need a globally redundant PaaS that allows me to ignore how computers work. In my experience, this is an outage waiting to happen.
That’s what I mean about taking control of your destiny. Building my product on hardware is transparent. When something goes wrong, it’s understandable. A DIMM has failed. We lost a drive. The system needs to be changed. It makes sense, and I have a timeline and options I can control.
But with cloud providers, there are millions of lines of code between my content and anything real. No one really understands how it all works. When Cloudflare’s bot management system started suppressing a malformed configuration file today, it shut down services that had nothing to do with bot management. When something breaks, it can take hours for anyone to acknowledge the problem, and there is little transparency about how long it will take to fix. Meanwhile the customers are shouting.
The right way to think about it
This has informed our philosophy regarding how we choose to build or buy software:
Create what provides your value. If I need something to distribute my products, I try as hard as I can to make it myself. I want to own it. I want to control it. I don’t want to depend on anyone else or suffer their mistakes. If I can’t build it for cost or expertise reasons, I’d like to buy something that is as simple as possible. Something where the abstraction layer is as thin as possible.
Buy everything else. If I don’t need it to provide my services, I want to buy it. I want to buy analytics. I want to buy CRM. I want to buy business operations products.
There are some things you should probably buy, even if you don’t buy them from me.
This is not your main business. He has solved the problems. Making them yourself is like Jurassic Park deciding to make your own door locks. How did that work out?
abstract problem
The real danger isn’t in buying software, it’s in buying abstract things that are so complex that you don’t understand what’s happening when they fail. Yesterday’s Cloudflare outage is a perfect example of this. A permission change in the database caused the configuration file to double in size, exceeding a hard-coded limit in their proxy software, causing 5xx errors throughout their network.
How many layers of abstraction are those? If this were your system how many of those layers could you debug?
When you build on top of these huge platforms, you’re not just outsourcing your infrastructure – you’re outsourcing your ability to understand and fix problems. You’re trading control for convenience, and when that convenience fails, you’re left helpless.
learn from dinosaurs
In Jurassic Park, they built everything themselves because they thought they were special. They felt their needs were unique. They thought they could do it better. They were wrong.
But it would have been equally wrong to outsource everything to InGen Cloud Services™ and hope for the best. The answer doesn’t lie in extremes – it lies in being thoughtful about what you make and what you buy.
Create what makes you unique. Buy what makes you run. And whatever you do, make sure you understand how it works well enough to fix it if it breaks.
Because it will break. And when that happens, “We’re experiencing higher than normal call volumes” won’t hurt your customers.
Todd Gardner is the CEO and co-founder of TrackJS, RequestMetrics, and CertKit. He has been building software for over 20 years and has strong opinions about JavaScript, infrastructure, and dinosaurs.