Tangled Mess

Once upon a time our software was simple. All it needed to go online was a database, a user interface, and some glue code running behind a web server farm between them. Most of the business-critical data remained in the database. People rented space in colocation centers and purchased servers to run their databases and code. Those co-location centers were not reliable, and your business could go down due to power or cooling failure. The servers were also not reliable, and you could lose data due to disk failures. A cable failure may shut down the network for hours or days. To avoid such booboos, IT teams rented space in secondary co-location centers, purchased servers and expensive replication systems to replicate data to those secondary sites, and took steps to bring down the secondary sites in the event of disasters.

Businesses were concerned about doing business with other companies whose infrastructure failure could harm them. So legal people got involved to add contractual language to their agreements. Those clauses required their vendors to implement disaster recovery policies. In some cases, those contracts also require declaring procedures for recovering critical systems at the secondary site, testing them once or twice a year, and sharing the results. Since violations of such contracts can lead to termination of business agreements, companies have hired compliance professionals to draft rules for their technical teams to follow. Tech teams reluctantly followed those rules to ensure that their backups and replication were working and that they could bring the most critical systems to secondary co-location centers. Paying such a disaster recovery tax was justified because the infrastructure was unreliable.

It was fine until it got cloudy. Cloud providers began providing a variety of infrastructure services for building software. Open source gives us even more flexibility to introduce new software architecture patterns. As engineers, we liked that option and started pushing for it, adopting whatever cloud companies started offering it. Of course, this option gave us tremendous speed benefits without requiring years of engineering investment to build complex distributed systems. Those services also began offering greater fault tolerance and availability than most enterprises. To take advantage of that flexibility it became necessary to move systems from on-premises to the cloud. Many people have built their careers around cloud transformation, and this continues to be the case.

For a while, cloud services were bad. It is up to customers to keep their business running even when the cloud provider’s systems are down. It seemed sensible to work around those problems and build redundancy in another cloud region. Many people tried that pattern. Some, like Netflix, succeeded, or at least are known to be successful; I don’t know if that is still the case today. Many people have had partial success running some stateless parts of the business from multiple cloud regions.

Around the same time, the SaaS industry took off. The proliferation of online systems has increased complexity and increased the appetite for automation in enterprises. This created opportunities for SaaS-based companies to fill that gap and provide a variety of services from infrastructure to customer service to finance to sales and marketing. Relying on third-party SaaS became a necessity for every enterprise. You can no longer take code to production without relying on another company’s subscription or pay-as-you-go service. The net result of this flexibility and abundance is that almost everything is now interconnected.

we are a tangled mess

We are now in a tangled mess. There are no independent systems anymore. The fate of our systems is shared with much of the Internet. Almost every business is now dependent on other companies, which mostly consume their services. Thus, all bets are off on the creation of unnecessary secondary sites. Not only will you need to get your part correct, but you will also need to make all your dependencies completely redundant. Most companies cannot even make their software redundant in multiple locations due to the variety of services they are creating, their interconnectedness, and the type of infrastructure needs. After all, building highly available and fault-tolerant systems requires more discipline, talent, and time than most enterprises. Let’s not deceive ourselves.

Where does this leave us?

First, let’s break free from the old paradigm of redundancy in secondary sites. This is overly simplistic thinking. It no longer makes sense for most companies to waste their precious resources on building redundancy across multiple cloud regions. Yes, cloud providers will fail from time to time, as was the case with the AWS us-east-1 outage last week. Yet they are still encouraged to invest billions of dollars and time in the resiliency of their infrastructure, services and processes. As for you, instead of focusing on redundancy, invest in learning to use those cloud services properly. These days, most cloud services offer knobs to help their customers avoid disasters (such as automatic backups, database failure, and availability zone failure). Know what they are, and follow carefully.

Second, if you really need and care about five or more nine full (i.e., not brown-out) availability for your business, make sure your business can afford the cost. To achieve such availability, you have to do several things right. You need the right talent who understands how to build highly available, fault-tolerant systems. In most cases, you have to develop that talent in-house because such talent is rare. You then need to standardize patterns like cells, global caches, replication systems, and eventual consistency for every critical piece of code you create. You’ll need to invest in paved paths to make those patterns easier to follow. It takes time to implement those patterns, and you need to get them right. Most importantly, you also need a disciplined engineering culture that prioritizes high availability and fault tolerance in every decision. Your culture needs to embrace limited options and sacrifice flexibility in favor of high availability and fault tolerance.

Third, somehow get your compliance and legal people to refine or avoid dangerous parts like “secondary sites.” Unless your company’s architecture is stuck in time like it was 20 years ago, such language means nothing. Refining such language may be easier said than done because some contracts are dated, and getting it right may be the least important priority for your legal teams. Don’t get me wrong. You still need to invest in game days and similar failure-embracing practices to build resilience into your culture. But how you practice flexibility needs to evolve.

We are truly in a tangled mess. Flexibility is expensive. Know what you need, figure out business costs and do what you can.



Leave a Comment