You only live once, self host a NAT Gateway

Society would have you believe that hosting a NAT gateway yourself is “crazy”, “irresponsible” and potentially even “dangerous”. But in this post I hope to shed light on why someone would go this route, the benefits, and what my actual experience would be when implementing it in a real engineering organization.

What is NAT gateway?

It is important to start WhyWhy would anyone think about replacing a core part of the AWS infrastructure, What Does NAT gateway do this too? For those unfamiliar, a NAT gateway acts as a one-way door to access the Internet without allowing traffic through your private subnet. This is an important part of good network design. If the traffic was allowed through, it would create a major security issue – anyone on the Internet could access your internal services. NAT Gateway is a bouncer in a club – but this club only allows people to go out, no one can enter.

NAT gateway diagram

The problem this creates is a bottleneck – your internal services have to talk to the Internet (ever think about an API call). Your entire infrastructure depends on NAT gateways to handle outbound Internet traffic.

AWS has entered the chat

AWS is ready for this – people need high availability, high uptime NAT gateways to get things done. And because of this requirement they can charge (in my opinion) exorbitant amounts of money to provide this service. What are you going to do? They can guarantee that this critical part of the infrastructure will be scalable and highly available when your ChatGPT wrapper explodes!

DevOps and infrastructure engineers know the pain of seeing the NAT Gateway Hours and NAT Gateway Bytes line items on an AWS bill. society is talking down your neck “There’s nothing you can do about it” And “Think of it as the cost of doing business”I tell them, you are wrong, you can do whatever you decide,

Why would you even think like this?

Before we dive into my implementation, I think it’s important to point out that this is not a one size fits all. I recently worked with Vitalize to speed up some of their Github tasks. We decided to self-host the Github runners in their own private subnet, along with a very robust and deep set of integration tests that run Everyone PR. Because of this, the major cost was NAT gateway bytes, due to the huge amount of traffic going through their private subnets.

This was the main motivation behind starting the search here. We still run NAT gateways in production (for now), but in an environment where the risk is not as high, with larger costs, the ability to remove potentially 10-15% of your daily AWS bill is quite attractive (depending on how much the cost of the NAT gateway contributes to your overall spend).

Option

So you’ve made it this far – now it’s time to start shopping. The good news is that there are heroes in the open source community who have done most of the heavy lifting for you. In my research, I found 2 major options.

Option 1: Fck-NAT

This is the main option that people get when they view it for the first time. It’s basically based on a purpose-built AWS AMI image that Andrew Guenther is maintaining. There are some limitations, as explained in the public facing docs here, but in general it’s fairly straightforward. They have a Terraform module that makes it fairly intuitive to set things up, which I go into more depth under the Implementation section below.

Option 2: Alternate

This is another option that I thought deserved a special mention. Maintained here by Chime, this is a much more thorough and ‘production’ alternative to Fck-NAT.

As mentioned above, if anything happens to your EC2 instance, a self-hosted NAT gateway (running on an EC2 instance) can become a bottleneck. The way Alternate/Chime has solved this problem is quite clever (and complex). From my initial understanding, they use a mix of instances in availability zones (similar to Fck-NAT) to overcome downtime in a certain AZ. But they take it a step further and employ Lambda to continuously poll the instance to ensure that the EC2 instance is behaving as expected. In combination with the standby NAT gateway, this allows you to immediately failover to an AWS-managed NAT gateway if the EC2 instance ever fails. Although this will not result in 0 downtime, it can minimize any disruption by automatically updating route tables.

Alternate NAT Network

I encourage people to check out this repo as it is quite feature packed. I have also attached their network diagram below. We didn’t end up using it because it was a little overkill for our purpose. Additionally, it relies on the standby NAT gateway, which I was trying to eliminate entirely. If I ever get it into production, this is the approach I’ll take.

execution

In this implementation, since it was primarily a cost cutting exercise, I decided to go with Fck-NAT. If this were a production environment, Alternate’s fallback mechanisms and robustness are more attractive. But really in this case I wanted to completely remove the NAT gateway cost from our development environment.

Eventually I started adopting the official Terraform module suggested by Fck-NAT. You can see an excerpt from our network module below.

module "fck_nat" {
  source  = "RaJiska/fck-nat/aws"
  version = "1.3.0"
  count   = var.use_fck_nat ? 2 : 0

  name          = "${var.company_name}-fck-nat-${count.index + 1}"
  vpc_id        = aws_vpc.main.id
  subnet_id     = module.subnets.public_subnet_ids[count.index]
  instance_type = var.fck_nat_instance_type

  tags = {
    Name        = "${var.company_name}-fck-nat-${count.index + 1}"
    Environment = var.env
  }
}

We implemented this using 2 t4g.nano instances. Implementing this resulted in about 15-30 seconds of downtime in our development environment, which was done in the middle of the night to avoid any angry developers.

Result

In our case, the results were quite dramatic. To begin with, we were able to cut NATGateway-Hours up to 50. We maintain a development and production environment, and we have completely eliminated NAT gateways in development:

hourly cost results

But more surprising and dramatic, the cost savings were around NATGateway-BytesAs mentioned, in this case we had self-hosted Github Runner and preview environments that drove a lot of traffic when developers were active, During the week, we will regularly see $30-$40 of traffic dailyAfter implementing this change, the highest we have seen is closer to $6,

In this case, I think most of it was driven by two main factors:

  1. Each PR we have will create a preview environment which will then run a full suite of playwright tests. This will run for every PR, for every commit. Although the overhead on the compute was fairly low as they were not too demanding, I believe the amount of traffic contributed to this.
  2. I believe the main cost of self-hosted runners was actually streaming the logs back to Github. I spot-checked some of our tests (unit, integration, etc.), and almost every single log file downloaded from Github would be ~40-50MB in size. Doing some math, about 5-6 tests per commit per PR means about 250MB per commit, and assuming the average PR has about 5 commits, that’s about 1.25GB of data being streamed back to GitHub (and through our AWS NAT gateway) per PR. This can easily start to add up, and I believe it has also contributed to our high costs.

bytes cost result

Another interesting data point that might be relevant to anyone thinking about implementing this: In our implementation, as mentioned, we went with two t4g.nano instances. During the week, we will see peaks of 800GB-900GB of traffic per day. But these two examples are able to handle this load easily, without any degradation that can be felt by developers.

data results

Overall, of these two major costs, we have typically seen about a 70% cost reduction NAT gateway cost, which has been quite impactful to our total daily spend in this organization.

conclusion

This may not be for all organizations, but if you find yourself spending money in a NAT gateway, and you have environments where the stakes are lower (e.g. a development or staging environment), then self-hosting the NAT gateway is worth it. Very Simpler than you might expect. The open source community has made this really simple with Terraform out of the box.

Sometimes, society can be wrong. Change requires risk takers – courageous humans who choose not to listen to the status quo. You only live once – host your own NAT gateway.



Leave a Comment