A ‘small’ vanilla Kubernetes install on NixOS

Kubernetes is a complex piece of technology that abstracts away many system administration tasks, but does also solve and automate some processes useful at a smaller scale, like blue-green deployments. Having administered managed Kubernetes for a while now, I wanted to find out what a self-managed, small-but-multi-node Kubernetes install looks like.

Most of the non-Kubernetes machines I manage are individual machines, or single database + multiple workers. For this step I’m not really interested in much more than that, like making everything redundant, self-healing, etc. I just want to introduce Kubernetes in something that matches my existing setups.

Getting things fully functional was a long process of trial-and-error, during which I learned about even more things I didn’t want to touch:

  • Public-Key Infrastructure (PKI). Kubernetes definitely leans into this and prefers you manage keys and certificates for all of its components, but I feel like this is a whole separate article in itself.

  • The NixOS Kubernetes modules. These have their own opinions, and there’s nothing wrong with their implementation, but using them goes against some of the learning and experimenting I wanted to do here.

  • K3s, K0s or any other Kubernetes ‘distribution’. These are an extra layer to learn, and an extra layer to trust. They sometimes offer valuable extra functionality, for example I wish the SQLite backend was in upstream Kubernetes. But again, I avoided these in the interest of learning.

NixOS in general is great, and I’m a big fan, but something Kubernetes can potentially do well (in terms of configuration) is provide a clear boundary between the system and application. In NixOS, configuring an app is often interwoven with system config, and there’s a lack of options to prevent that.

Still, I’ll be using the Kubernetes package (not module!) from Nixpkgs, as well as building everything on top of NixOS and its excellent systemd parts.

A fully functioning QEMU setup for the end result can be found at:
https://codeberg.org/kosinus/nixos-kubernetes-experiment

Basic NixOS configuration

At the time of writing, NixOS 25.11 is mere weeks away, so that is my target.

There’s a bunch of stuff I enable on all of my NixOS machines that is relevant to the rest of this article.

I prefer nftables over iptables, because it’s the future. In practice, the
iptables command is already a compatibility layer in many Linux distributions, but these options additionally enable the nftables-based firewall in NixOS:

{
  networking.nftables.enable = true;
  
  
  networking.firewall.filterForward = true;
}

I enable systemd-networkd, because it’s the future. I wouldn’t even know how to set up all the networking parts in other setups; systemd-networkd is just really nice when you have a bunch of moving parts in your networking.

{
  networking.useNetworkd = true;
}

Kubernetes version

The current version of Kubernetes at the time of writing is 1.34. It’s useful to check the package version, because Kubernetes requires step-by-step minor version upgrades:

{ lib, pkgs, ... }:
{
  
  
  assertions = [
    {
      assertion = lib.hasPrefix "1.34." pkgs.kubernetes.version;
      message = "Unexpected Kubernetes package version: ${pkgs.kubernetes.version}";
    }
  ];
}

Networking

If you’ve ever used Docker or Podman, your typical networking setup looks like this:

Machine Host namespace eth0 Bridge Container veth Container veth Container veth

The machine is logically split up in host and container network namespaces. Each container is assigned half of a veth pair, the other half is part of a a bridge interface on the host. The host assigns a subnet to the bridge with an address for itself, like 172.16.0.1/24, and an address for each container. The host is then the gateway for containers, performing layer 3 routing and NAT on outgoing traffic to the internet.

Kubernetes wants you to connect these container subnets across multiple machines. In this article I assume there is a private network connecting all nodes together:

Machine eth0 eth1 Bridge Container veth Machine eth0 eth1 Bridge Container veth Switch

In addition to the ‘outward’ link from the host to the internet, the host now has an additional outward link to a network switch that brings hosts together in a private network. We intend to route traffic between container subnets across this private network somehow. Notably, NAT is still only performed on traffic to the internet, and not traffic between containers.

Even if you have a private network like this, you may not be able to simply route traffic from container subnets across it. Cloud providers often restrict the addresses a machine can use on its network interface to what is preconfigured in the cloud resources.

There are a lot of ways to actually connect the subnets together, but I chose
Wireguard because I know it, and because I wanted to test drive the overhead of encrypted links with real applications. It’s potentially an additional layer of security if you’re running this on the network of a cloud provider that otherwise doesn’t encrypt customer traffic on internal networks. (But some may call you paranoid.)

Some alternatives here:

  • Use some other tunneling protocol like GENEVE or VXLAN. Maybe GRE works too?
  • Instead use TLS at the application layer for securing connections, e.g. HTTPS between proxy and backend, TLS to your database, etc.
  • If you control the physical network (or even just layer 2), you can actually connect containers directly to the network using macvlan and even have your existing DHCP server assign addresses.
  • Something like flannel can help you make the whole setup dynamic, if your machines tend to come and go.

Container subnets

First, let’s determine our addressing scheme for all of our containers across machines.

{ config, lib, ... }:
{
  
  
  
  options.kube = {
    
    
    nodeIndex = lib.mkOption { type = lib.types.ints.positive; };
    
    nodeIndex0 = lib.mkOption {
      type = lib.types.ints.unsigned;
      default = config.kube.nodeIndex - 1;
    };
    
    mkNodeCidr6 = lib.mkOption {
      type = with lib.types; functionTo str;
      default = index: "fd88:${toString index}::/32";
    };
    mkNodeCidr4 = lib.mkOption {
      type = with lib.types; functionTo str;
      default = index: "10.88.${toString index}.0/24";
    };
    
    
    mkHostIp6 = lib.mkOption {
      type = with lib.types; functionTo str;
      default = index: "fd88:${toString index}::1";
    };
    mkHostIp4 = lib.mkOption {
      type = with lib.types; functionTo str;
      default = index: "10.88.${toString index}.1";
    };
    
    nodeCidr6 = lib.mkOption {
      type = lib.types.str;
      default = config.kube.mkNodeCidr6 config.kube.nodeIndex;
    };
    nodeCidr4 = lib.mkOption {
      type = lib.types.str;
      default = config.kube.mkNodeCidr4 config.kube.nodeIndex;
    };
    hostIp6 = lib.mkOption {
      type = lib.types.str;
      default = config.kube.mkHostIp6 config.kube.nodeIndex;
    };
    hostIp4 = lib.mkOption {
      type = lib.types.str;
      default = config.kube.mkHostIp4 config.kube.nodeIndex;
    };
    
    
    servicesCidr = lib.mkOption {
      type = lib.types.str;
      default = "10.88.0.0/24";
    };
  };
}

Now each machine needs to assign the node index in per-machine configuration:

{
  kube.nodeIndex = 1;
}

Now we have everything to configure the bridge interface we’ll connect containers to. Unlike Docker / Podman, we’ll be managing this manually:

{ config, pkgs, ... }:
{
  
  systemd.network.netdevs."10-brkube" = {
    netdevConfig = {
      Kind = "bridge";
      Name = "brkube";
    };
  };
  
  systemd.network.networks."10-brkube" = {
    matchConfig = {
      Name = "brkube";
    };
    networkConfig = {
      
      
      ConfigureWithoutCarrier = true;
      
      LinkLocalAddressing = false;
      
      IPv6AcceptRA = false;
    };
    
    
    
    
    
    
    addresses = [
      {
        Address = "${config.kube.hostIp6}/32";
        DuplicateAddressDetection = "none";
      }
      {
        Address = "${config.kube.hostIp4}/24";
        DuplicateAddressDetection = "none";
      }
    ];
  };
  
  environment.systemPackages = [ pkgs.bridge-utils ];
}

Next we can setup the Wireguard links. For this we need to generate keypairs, and it is at this point that we introduce secrets into the NixOS config. I like to use agenix for this, but there are other choices here, like
sops-nix. With agenix, machines decrypt files using their OpenSSH host key.

For simplicity, I’m going to put all keys in keys/ directory, and add a master key so we can always edit all files locally:

mkdir keys
cd keys/



age-keygen -o master_key

Now create a keys/secrets.nix configuration file for agenix:

let
  
  
  master = "age...";
  
  node1 = "ssh-ed25519 AAA...";
  node2 = "ssh-ed25519 AAA...";
in
{
  
  "wgkube1.key.age".publicKeys = [ master node1 ];
  "wgkube2.key.age".publicKeys = [ master node2 ];
}

Then generate the Wireguard keys and immediately encrypt them:

wg genkey | agenix -i master_key -e wgkube1.key.age
wg genkey | agenix -i master_key -e wgkube2.key.age

Now we can decrypt these files in NixOS configuration:

{ config, ... }:
{
  
  age.secrets."wgkube.key" = {
    file = ./keys + "/wgkube${toString config.kube.nodeIndex}.key.age";
    
    group = "systemd-network";
    mode = "0440";
  };
}

Next I like to use a peers.json as input to generate the Wireguard configuration. That JSON looks like this:

[
  {
    "PublicKey": "pHEYIfgWiJEgnR8zKYGnWlbZbQZ0xb5eEyzVSpzz3BM=",
    "PeerIP": "192.168.0.1"
  },
  {
    "PublicKey": "TPB2lwnWPjjAZ1Pnn5A6sdhGAePztE5VlbQ/RmU89w4=",
    "PeerIP": "192.168.0.2"
  }
]

This array is ordered by node index. You can get the public keys as follows:

agenix -i master_key -d wgkube1.key.age | wg pubkey
agenix -i master_key -d wgkube2.key.age | wg pubkey

The PeerIP fields are local network IPs in this example. These could be IPs on the private network provided by your cloud provider, but because this is Wireguard, you can also safely cross the internet. (Though the internet is not necessarily always fast, reliable and within you control.)

I use a JSON file like this because I actually generate it using tofu, but to keep things focused, the tofu configuration will not be in scope of this article. There is a neat little Wireguard provider for it, though.

Now we can configure the links in NixOS:

{
  config,
  lib,
  pkgs,
  ...
}:
let
  
  
  inherit (config.kube)
    mkNodeCidr6
    mkNodeCidr4
    nodeIndex0
    wgPort
    peers
    ;
in
{
  options.kube = {
    
    
    wgPort = lib.mkOption {
      type = lib.types.port;
      default = 51820;
    };
    
    peers = lib.mkOption {
      type = with lib.types; listOf attrs;
      default = builtins.fromJSON (builtins.readFile ./keys/peers.json);
    };
  };

  config = {
    
    systemd.network.netdevs."11-wgkube" = {
      netdevConfig = {
        Kind = "wireguard";
        Name = "wgkube";
      };
      wireguardConfig = {
        PrivateKeyFile = config.age.secrets."wgkube.key".path;
        ListenPort = wgPort;
      };
      
      wireguardPeers = lib.pipe peers [
        (lib.imap1 (
          index: entry: {
            PublicKey = entry.PublicKey;
            Endpoint = "${entry.PeerIP}:${toString wgPort}";
            
            
            
            
            AllowedIPs = [
              (mkNodeCidr6 index)
              (mkNodeCidr4 index)
            ];
          }
        ))
        
        
        (lib.ifilter0 (index0: value: index0 != nodeIndex0))
      ];
    };
    
    systemd.network.networks."11-wgkube" = {
      matchConfig = {
        Name = "wgkube";
      };
      networkConfig = {
        
        ConfigureWithoutCarrier = true;
        LinkLocalAddressing = false;
        IPv6AcceptRA = false;
      };
      
      
      
      
      
      
      
      
      routes = lib.pipe peers [
        
        (lib.imap1 (
          index: entry: [
            {
              Destination = mkNodeCidr6 index;
              PreferredSource = config.kube.hostIp6;
            }
            {
              Destination = mkNodeCidr4 index;
              PreferredSource = config.kube.hostIp4;
            }
          ]
        ))
        
        (lib.ifilter0 (index0: value: index0 != nodeIndex0))
        
        lib.flatten
      ];
    };
    
    environment.systemPackages = [ pkgs.wireguard-tools ];
  };
}

Finally, we configure our firewall and NAT rules:

{ config, ... }:
{
  boot.kernel.sysctl = {
    
    "net.ipv4.conf.all.forwarding" = 1;
    "net.ipv6.conf.all.forwarding" = 1;
  };
  networking.firewall.extraInputRules = ''
    # Open the Wireguard port.
    # You probably have to adjust this for your network situation.
    ip saddr 192.168.0.0/24 udp dport ${toString config.kube.wgPort} accept
    # Accept connections to Kubernetes Cluster IPs.
    # These are virtual IPs that every node makes available locally.
    ip daddr ${config.kube.servicesCidr} accept
  '';
  networking.firewall.extraForwardRules = ''
    # Route all container traffic anywhere (internet and internode).
    iifname brkube accept
    # Route Wireguard traffic destined for local containers.
    iifname wgkube ip6 daddr ${config.kube.nodeCidr6} accept
    iifname wgkube ip daddr ${config.kube.nodeCidr4} accept
  '';
  
  
  
  networking.nftables.tables = {
    "kube-nat6" = {
      family = "ip6";
      name = "kube-nat";
      content = ''
        chain post {
          type nat hook postrouting priority srcnat;
          iifname brkube ip6 daddr fd88::/16 accept
          iifname brkube masquerade
        }
      '';
    };
    "kube-nat4" = {
      family = "ip";
      name = "kube-nat";
      content = ''
        chain post {
          type nat hook postrouting priority srcnat;
          iifname brkube ip daddr 10.88.0.0/16 accept
          iifname brkube masquerade
        }
      '';
    };
  };
}

At this point nodes should be able to ping eachother across the tunnel on their private IPs (fd88:*::1), but we won’t be able to test the full networking setup until we have some containers running.

Hostnames

Kubernetes needs to be configured with a domain name where it will advertise Services in DNS. Many examples use cluster.local, but I find this a bad idea, because .local is for mDNS. Instead, I’ll be using k8s.internal.

Nodes in Kubernetes register themselves with a name, typically whatever hostname is configured in the OS. However, I’m going to decouple this from the OS hostname and instruct Kubernetes to use k8s.internal everywhere, leaving the OS hostname untouched.

{
  config,
  lib,
  pkgs,
  ...
}:
let
  inherit (config.kube)
    peers
    nodeIndex
    mkHostIp6
    mkHostIp4
    domain
    mkNodeHost
    ;
in
{
  options.kube = {
    
    domain = lib.mkOption {
      type = lib.types.str;
      default = "k8s.internal";
    };
    
    mkNodeHost = lib.mkOption {
      type = with lib.types; functionTo str;
      default = index: "node${toString index}.${domain}";
    };
    
    nodeHost = lib.mkOption {
      type = lib.types.str;
      default = mkNodeHost nodeIndex;
    };
    
    
    allHosts = lib.mkOption {
      type = with lib.types; attrsOf (listOf str);
    };
    
    allHostsFile = lib.mkOption {
      type = lib.types.path;
      default = lib.pipe config.kube.allHosts [
        (lib.mapAttrsToList (ip: hosts: "${ip} ${lib.concatStringsSep " " hosts}\n"))
        lib.concatStrings
        (pkgs.writeText "kubernetes-static-hosts.txt")
      ];
    };
  };

  config = {
    
    
    
    kube.allHosts = lib.pipe peers [
      (lib.imap1 (
        index: entry: {
          ${mkHostIp6 index} = lib.mkBefore [ (mkNodeHost index) ];
          ${mkHostIp4 index} = lib.mkBefore [ (mkNodeHost index) ];
        }
      ))
      (lib.mergeAttrsList)
    ];

    
    networking.hostFiles = [ config.kube.allHostsFile ];
  };
}

kube-apiserver

We’re going to build a multi-node setup, but keep it close to a traditional setup of 1 database server + multiple workers. In this setup, the database server is the ideal place for any kind of centralized processing, so we’ll be running those parts of Kubernetes there as well. Instead of calling it a database server, I’ll call it the ‘primary’ server going forward.

{ config, lib, ... }:
{
  options.kube = {
    
    role = lib.mkOption {
      type = lib.types.str;
      default = if config.kube.nodeIndex == 1 then "primary" else "worker";
    };
    
    primaryIp = lib.mkOption {
      type = lib.types.str;
      default = config.kube.mkHostIp6 1;
    };
  };
}

We’ll add some further variables in kube.api to describe the API endpoint:

{ config, lib, ... }:
{
  options.kube.api = {
    
    
    serviceIp = lib.mkOption {
      type = lib.types.str;
      default = "10.88.0.1";
    };
    
    
    
    port = lib.mkOption {
      type = lib.types.port;
      default = 6443;
    };
    
    
    
    internalHost = lib.mkOption {
      type = lib.types.str;
      default = "api.${config.kube.domain}";
    };
    
    internalUrl = lib.mkOption {
      type = lib.types.str;
      default = "https://${config.kube.api.internalHost}:${toString config.kube.api.port}";
    };
    
    
    
    
    externalHost = lib.mkOption {
      type = lib.types.str;
      default = "test-kube.example.com";
    };
    
    
    externalUrl = lib.mkOption {
      type = lib.types.str;
      default = "https://${config.kube.api.externalHost}:${toString config.kube.api.port}";
    };
  };

  config = {
    
    kube.allHosts.${config.kube.primaryIp} = [ config.kube.api.internalHost ];
  };
}

The API server uses etcd for storage by default. We’ll be creating a very simple installation here and protect it using Unix sockets with limited permissions.

In a production setup, you want to make periodic backups of the data in etcd. You can do this using etcdctl snapshot save, or simply backup the files in
/var/lib/etcd/member/snap/db. (The former method can’t be piped into some other command, but the latter method excludes the database WAL file. See etcd disaster recovery.)

{
  config,
  lib,
  pkgs,
  ...
}:


lib.mkIf (config.kube.role == "primary") {

  
  users.groups.etcd = { };
  users.users.etcd = {
    isSystemUser = true;
    group = "etcd";
  };

  
  systemd.services.etcd = {
    wantedBy = [ "multi-user.target" ];
    serviceConfig = {
      Type = "notify";
      User = "etcd";
      ExecStart =
        "${pkgs.etcd}/bin/etcd"
        + " --data-dir /var/lib/etcd"
        
        
        + " --auto-compaction-retention=8h"
        
        
        + " --listen-peer-urls unix:/run/etcd/peer"
        + " --listen-client-urls unix:/run/etcd/grpc"
        + " --listen-client-http-urls unix:/run/etcd/http"
        
        + " --advertise-client-urls http://localhost:2379";
      Restart = "on-failure";
      RestartSec = 10;
      
      StateDirectory = "etcd";
      StateDirectoryMode = "0700";
      
      RuntimeDirectory = "etcd";
      RuntimeDirectoryMode = "0750";
    };
    postStart = ''
      # Need to make sockets group-writable to allow connections.
      chmod 0660 /run/etcd/{grpc,http}
    '';
  };

  
  environment.systemPackages = [ pkgs.etcd ];

}

Now we are almost ready to start the API server! First we need to put some secrets in place for it.

You’ll want an EncryptionConfiguration to tell Kubernetes how to encrypt Secret resources on disk. I recommend using a configuration with just
secretbox to start:


"EncryptionConfiguration.yaml.age".publicKeys = [ master node1 ];

agenix -i master_key -e EncryptionConfiguration.yaml.age
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  
  - resources:
      - secrets
    providers:
      - secretbox:
          keys:
            - name: key1
              
              secret: ""

Next we need credentials for API server authentication. There are a bunch of methods available for this, but we’ll be using the ‘static token file’ method, and handing a CSV file to the API server. A major downside of this is that the API server can’t reload this at runtime, so changing any of these (such as when adding nodes) requires an API server restart.

We’re going to create a root user in the API with full admin access.


"kube_token_root.age".publicKeys = [ master node1 ];

pwgen -s 64 | agenix -i master_key -e kube_token_root.age

Nodes also need tokens to register themselves in the API, and I’m going to use a dirty trick here: reuse the Wireguard private keys as tokens. This means the API server has access to all Wireguard private keys, but I figure compromise of the API server means you can execute arbitrary code on any node anyway. If you’re more concerned, you could just generate separate tokens instead. In any case, to reuse the Wireguard keys, the primary node needs access:


"wgkube1.key.age".publicKeys = [ master node1 ];
"wgkube2.key.age".publicKeys = [ master node1 node2 ];

We also need some tokens for Kubernetes components that run alongside the API server on the primary node. I’m going to use the kube_token_system_ prefix for these, followed by the service name. That naming convention allows us to iterate files later.


"kube_token_system_kube-controller-manager.age".publicKeys = [ master node1 ];
"kube_token_system_kube-scheduler.age".publicKeys = [ master node1 ];

for uid in kube-controller-manager kube-scheduler; do
  pwgen -s 64 | agenix -i master_key -e "kube_token_system_${uid}.age"
done

To connect these components to the API server, we provide a tool to help generate a kubeconfig file:

{
  config,
  lib,
  pkgs,
  ...
}:
{
  options.kube = {
    
    
    mkkubeconfig = lib.mkOption {
      type = lib.types.package;
      default = pkgs.writeShellApplication {
        name = "mkkubeconfig";
        runtimeInputs = [ pkgs.kubectl ];
        text = ''
          if [[ $# -ne 1 ]]; then
            echo >&2 'Usage: mkkubeconfig '
            exit 64
          fi

          # NOTE: The API server uses self-signed certificates. In this
          # testing setup we instead rely on the Wireguard tunnel for security.
          kubectl config set-cluster local --server '${config.kube.api.internalUrl}' --insecure-skip-tls-verify=true
          kubectl config set users.default.token "$(<"$1")"
          kubectl config set-context local --cluster=local --user=default
          kubectl config use-context local
        '';
      };
    };
  };
}

We can finally slap together a NixOS module to start the API server. This is probably the most complex piece of Nix machinery in the setup.

{
  config,
  lib,
  pkgs,
  ...
}:

let

  package = lib.getBin pkgs.kubernetes;
  apiPortStr = toString config.kube.api.port;

  
  
  
  keysDirListing = builtins.readDir ./keys;
  ageSecrets = lib.mergeAttrsList [
    
    { "EncryptionConfiguration.yaml".file = ./keys/EncryptionConfiguration.yaml.age; }
    
    (lib.pipe keysDirListing [
      (lib.filterAttrs (name: type: lib.hasPrefix "kube_token_" name))
      (lib.mapAttrs' (
        name: type: {
          name = lib.removeSuffix ".age" name;
          value.file = ./keys + "/${name}";
        }
      ))
    ])
    
    (lib.pipe keysDirListing [
      (lib.filterAttrs (name: type: lib.hasPrefix "wgkube" name))
      (lib.mapAttrs' (
        name: type: {
          name = "kube_token_node" + (lib.removePrefix "wgkube" (lib.removeSuffix ".key.age" name));
          value.file = ./keys + "/${name}";
        }
      ))
    ])
  ];

in


lib.mkIf (config.kube.role == "primary") {

  age.secrets = ageSecrets;

  
  users.groups.kube-apiserver = { };
  users.users.kube-apiserver = {
    isSystemUser = true;
    group = "kube-apiserver";
    extraGroups = [ "etcd" ];
  };

  
  networking.firewall.extraInputRules = ''
    tcp dport ${apiPortStr} accept
  '';

  systemd.services.kube-apiserver = {
    wantedBy = [ "multi-user.target" ];
    after = [ "etcd.service" ];
    serviceConfig = {
      Type = "notify";
      ExecStart =
        "${package}/bin/kube-apiserver"
        
        + " --etcd-servers="unix:/run/etcd/grpc""
        
        
        + " --secure-port=${apiPortStr}"
        + " --tls-private-key-file="/var/lib/kube-apiserver/apiserver.key""
        + " --tls-cert-file="/var/lib/kube-apiserver/apiserver.crt""
        
        
        + " --anonymous-auth=false"
        + " --token-auth-file="/var/lib/kube-apiserver/tokens.csv""
        + " --authorization-mode="RBAC,Node""
        
        
        + " --service-cluster-ip-range="${config.kube.servicesCidr}""
        
        
        + " --advertise-address="${config.kube.hostIp4}""
        
        + " --external-hostname="${config.kube.api.externalHost}""
        
        + " --service-account-issuer="${config.kube.api.externalUrl}""
        + " --api-audiences="api,${config.kube.api.externalUrl}""
        + " --service-account-key-file="/var/lib/kube-apiserver/issuer.key""
        + " --service-account-signing-key-file="/var/lib/kube-apiserver/issuer.key""
        
        
        + " --encryption-provider-config='%d/EncryptionConfiguration.yaml'";
      User = "kube-apiserver";
      Restart = "on-failure";
      RestartSec = 10;
      
      StateDirectory = "kube-apiserver";
      
      LoadCredential = map (name: "${name}:/run/agenix/${name}") (lib.attrNames ageSecrets);
      
      PrivateTmp = true;
    };
    preStart = ''
      openssl=${lib.getExe pkgs.openssl}
      cd /var/lib/kube-apiserver

      # Ensure a tokens file is present, or create an empty one.
      [[ -e tokens.csv ]] || touch tokens.csv
      chmod 0600 tokens.csv

      # Ensure the token for the root user is present.
      file="$CREDENTIALS_DIRECTORY/kube_token_root"
      if ! grep -q ",root," tokens.csv; then
        echo "$(<"$file"),root,root,system:masters" >> tokens.csv
      fi

      # Ensure tokens for system users are present.
      for file in $CREDENTIALS_DIRECTORY/kube_token_system_*; do
        filename="$(basename "$file")"
        uid="''${filename#kube_token_system_}"
        if ! grep -q ",system:$uid," tokens.csv; then
          echo "$(<"$file"),system:$uid,system:$uid" >> tokens.csv
        fi
      done

      # Ensure tokens for nodes are present.
      for file in $CREDENTIALS_DIRECTORY/kube_token_node*; do
        filename="$(basename "$file")"
        uid="''${filename#kube_token_}.${config.kube.domain}"
        if ! grep -q ",system:node:$uid," tokens.csv; then
          echo "$(<"$file"),system:node:$uid,system:node:$uid,system:nodes" >> tokens.csv
        fi
      done

      # Ensure a private key for HTTPS exists.
      [[ -e apiserver.key ]] || $openssl ecparam -out apiserver.key -name secp256r1 -genkey
      chmod 0600 apiserver.key

      # Generate a new self-signed certificate on every startup.
      # Assume services are restarted somewhere in this timeframe so that we
      # never have an expired certificate.
      $openssl req -new -x509 -nodes -days 3650 \
        -subj '/CN=${config.kube.api.externalHost}' \
        -addext 'subjectAltName=${
          lib.concatStringsSep "," [
            "DNS:${config.kube.api.externalHost}"
            "DNS:${config.kube.api.internalHost}"
            "IP:${config.kube.api.serviceIp}"
          ]
        }' \
        -key apiserver.key \
        -out apiserver.crt

      # Ensure a private key exists for issuing service account tokens.
      [[ -e issuer.key ]] || $openssl ecparam -out issuer.key -name secp256r1 -genkey
      chmod 0600 issuer.key
    '';
    postStart = ''
      # Wait for the API server port to become available.
      # The API server doesn't support sd_notify, so we do this instead to
      # properly signal any dependant services that the API server is ready.
      export KUBECONFIG=/tmp/kubeconfig
      ${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube_token_root"
      tries=60
      while ! ${package}/bin/kubectl get namespaces default >& /dev/null; do
        if [[ $((--tries)) -eq 0 ]]; then
          echo ">> Timeout waiting for the API server to start"
          exit 1
        fi
        sleep 1
      done
      rm $KUBECONFIG
    '';
  };

}

We setup a kubeconfig for root on the primary node to use the root API user. This allows using kubectl from the shell for easy administration:

{
  config,
  lib,
  pkgs,
  ...
}:


lib.mkIf (config.kube.role == "primary") {

  
  system.activationScripts.kubeconfig-root = ''
    HOME=/root ${lib.getExe config.kube.mkkubeconfig} "/run/agenix/kube_token_root"
  '';

  environment.systemPackages = [ pkgs.kubectl ];

}

And we also make node credentials available on each node, which will be used by services later:

{ lib, config, ... }:
{

  
  
  systemd.services.generate-kubeconfig-node = {
    wantedBy = [ "multi-user.target" ];
    environment.KUBECONFIG = "/run/kubeconfig-node";
    serviceConfig = {
      Type = "oneshot";
      ExecStart = "${lib.getExe config.kube.mkkubeconfig} /run/agenix/wgkube.key";
    };
  };

}

Add-ons

It’s useful to have a way to load some YAML into the API server on startup. I use the term add-ons because I’ve seen it used for some now-deprecated functionality that was similar in function, but the term add-on has also been overloaded in various ways.

{
  config,
  lib,
  pkgs,
  ...
}:
let
  cfg = config.kube;
in
{
  options.kube = {
    
    activationScript = lib.mkOption {
      type = lib.types.lines;
      default = "";
    };
    
    addons = lib.mkOption {
      type = lib.types.listOf lib.types.path;
      default = [ ];
    };
  };

  config = {

    assertions = [
      {
        assertion = cfg.activationScript != "" -> cfg.role == "primary";
        message = "kube.activationScript and kube.addons can only be used on the primary node";
      }
    ];

    
    
    systemd.services.kube-activation = lib.mkIf (cfg.activationScript != "") {
      wantedBy = [ "multi-user.target" ];
      bindsTo = [ "kube-apiserver.service" ];
      after = [ "kube-apiserver.service" ];
      path = [ pkgs.kubectl ];
      
      environment.KUBECONFIG = "/root/.kube/config";
      serviceConfig = {
        Type = "oneshot";
        RemainAfterExit = true;
      };
      script = cfg.activationScript;
    };

    
    kube.activationScript = lib.mkIf (cfg.addons != [ ]) ''
      for file in ${lib.escapeShellArgs (pkgs.copyPathsToStore cfg.addons)}; do
        echo >&2 "# $file"
        kubectl apply --server-side --force-conflicts -f "$file"
      done
    '';

  };
}

kube-scheduler

Next we need to run kube-scheduler to actually schedule pods:

{
  config,
  lib,
  pkgs,
  ...
}:


lib.mkIf (config.kube.role == "primary") {

  systemd.services.kube-scheduler = {
    wantedBy = [ "multi-user.target" ];
    requires = [ "kube-apiserver.service" ];
    after = [ "kube-apiserver.service" ];
    serviceConfig = {
      ExecStart =
        "${pkgs.kubernetes}/bin/kube-scheduler"
        
        + " --kubeconfig='/tmp/kubeconfig'"
        
        + " --secure-port=0";
      Restart = "on-failure";
      RestartSec = 10;
      
      DynamicUser = true;
      
      PrivateTmp = true;
      LoadCredential = "kube-token:/run/agenix/kube_token_system_kube-scheduler";
    };
    preStart = ''
      # Generate a kubeconfig for the scheduler. Relies on PrivateTmp.
      KUBECONFIG=/tmp/kubeconfig ${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube-token"
    '';
  };

}

kube-controller-manager

Similarly, we need to run kube-controller-manager, which contains all the standard Kubernetes controllers:

{
  config,
  lib,
  pkgs,
  ...
}:


lib.mkIf (config.kube.role == "primary") {

  systemd.services.kube-controller-manager = {
    wantedBy = [ "multi-user.target" ];
    
    
    
    bindsTo = [ "kube-apiserver.service" ];
    after = [ "kube-apiserver.service" ];
    serviceConfig = {
      ExecStart =
        "${pkgs.kubernetes}/bin/kube-controller-manager"
        
        + " --kubeconfig='/tmp/kubeconfig'"
        
        + " --secure-port=0"
        
        
        + " --use-service-account-credentials=true"
        
        
        + " --root-ca-file="/var/lib/kube-apiserver/apiserver.crt"";
      Restart = "on-failure";
      RestartSec = 10;
      
      DynamicUser = true;
      
      PrivateTmp = true;
      LoadCredential = "kube-token:/run/agenix/kube_token_system_kube-controller-manager";
    };
    preStart = ''
      # Generate a kubeconfig for the controller manager. Relies on PrivateTmp.
      KUBECONFIG=/tmp/kubeconfig ${lib.getExe config.kube.mkkubeconfig} "$CREDENTIALS_DIRECTORY/kube-token"
    '';
  };

}

CoreDNS

We need to provide DNS resolution based on Services in the Kubernetes API.

Many deployments run CoreDNS inside Kubernetes, but there’s really no standard for how you implement DNS resolution, and different deployments have different needs. As long as you have something that fetches Services from the Kubernetes API.

Here we setup CoreDNS, but not inside Kubernetes, instead managed by NixOS. We run an instance on every node for simplicity.

{
  config,
  lib,
  pkgs,
  ...
}:
{

  services.coredns = {
    enable = true;
    config = ''
      . {
        bind ${config.kube.hostIp6}
        errors

        # Resolve Kubernetes hosts.
        hosts ${config.kube.allHostsFile} ${config.kube.domain} {
          reload 0
          fallthrough
        }

        # Resolve Kubernetes services.
        kubernetes ${config.kube.domain} {
          kubeconfig {$CREDENTIALS_DIRECTORY}/kubeconfig-node
          ttl 30
          # NOTE: No fallthrough, to prevent a loop with systemd-reoslved.
        }

        # Forward everything else to systemd-resolved.
        forward . 127.0.0.53 {
          max_concurrent 1000
        }

        cache 30
        loadbalance
      }
    '';
  };

  
  systemd.services.coredns = {
    requires = [ "generate-kubeconfig-node.service" ];
    after = [
      "generate-kubeconfig-node.service"
      "kube-activation.service"
    ];
    serviceConfig.LoadCredential = "kubeconfig-node:/run/kubeconfig-node";
  };

  
  environment.etc."systemd/dns-delegate.d/kubernetes.dns-delegate".text = ''
    [Delegate]
    Domains=${config.kube.domain}
    DNS=${config.kube.hostIp6}
  '';

  
  networking.firewall.extraInputRules = ''
    ip6 saddr ${config.kube.nodeCidr6} udp dport 53 accept
    ip6 saddr ${config.kube.nodeCidr6} tcp dport 53 accept
  '';

  
  kube.addons = lib.mkIf (config.kube.role == "primary") [
    ./addons/coredns.yaml
  ];

  
  environment.systemPackages = [ pkgs.dig ];

}

The referenced add-on file addons/coredns.yaml creates the permissions needed for CoreDNS to access the Kubernetes API:






---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: system:coredns
  namespace: kube-system
rules:
  - apiGroups:
      - ""
    resources:
      - endpoints
      - services
      - pods
      - namespaces
    verbs:
      - list
      - watch
  - apiGroups:
      - discovery.k8s.io
    resources:
      - endpointslices
    verbs:
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:coredns
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:coredns
subjects:
  - kind: Group
    name: system:nodes
    apiGroup: rbac.authorization.k8s.io

kube-proxy

Kube-proxy is what implements cluster IPs assigned to Service resources in the API. It generates firewall rules to NAT cluster IPs to destination pods. It needs to run on every node.

(NOTE: If you decide not to run kubelet on your control plane / primary node, you still need to run kube-proxy! The API server may sometimes contact Services via their Cluster IP too.)

{
  lib,
  config,
  pkgs,
  ...
}:
{

  systemd.services.kube-proxy = {
    wantedBy = [ "multi-user.target" ];
    requires = [ "generate-kubeconfig-node.service" ];
    after = [
      "generate-kubeconfig-node.service"
      "kube-activation.service"
    ];
    path = [ pkgs.nftables ];
    serviceConfig = {
      ExecStart =
        "${lib.getBin pkgs.kubernetes}/bin/kube-proxy"
        
        + " --kubeconfig='/run/kubeconfig-node'"
        + " --hostname-override="${config.kube.nodeHost}""
        
        + " --proxy-mode=nftables"
        
        + " --detect-local-mode=BridgeInterface"
        + " --pod-bridge-interface=brkube"
        
        + " --nodeport-addresses="${config.kube.hostIp6}/128,${config.kube.hostIp4}/32""
        
        + " --healthz-bind-address=[::1]:10256"
        + " --metrics-bind-address=[::1]:10249";
      Restart = "on-failure";
      RestartSec = 10;
    };
  };

  
  kube.addons = lib.mkIf (config.kube.role == "primary") [
    ./addons/kube-proxy.yaml
  ];

}

The referenced add-on file addons/kube-proxy.yaml is again necessary to setup permissions in the Kubernetes API:



---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:kube-proxy
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-proxier
subjects:
  - kind: Group
    name: system:nodes
    apiGroup: rbac.authorization.k8s.io

kubelet

Kubelet is the meat that starts containers on a node, when a pod is assigned to that node. Here we also do the work to setup the cri-o container runtime and the CNI configuration that tells it how containers get network.

You technically only need kubelet on machines that run workloads. We simply start it everywhere, including our primary node, but mark the primary as non-schedulable to demonstrate registerWithTaints.

{
  lib,
  config,
  pkgs,
  ...
}:

let

  yaml = pkgs.formats.yaml { };

  kubeletConfig = yaml.generate "kubelet.conf" {
    apiVersion = "kubelet.config.k8s.io/v1beta1";
    kind = "KubeletConfiguration";
    
    
    address = config.kube.hostIp6;
    authentication.anonymous.enabled = true;
    authorization.mode = "AlwaysAllow";
    
    healthzPort = 0;
    
    containerRuntimeEndpoint = "unix:///var/run/crio/crio.sock";
    
    failSwapOn = false;
    memorySwap.swapBehavior = "LimitedSwap";
    
    clusterDomain = config.kube.domain;
    clusterDNS = [ config.kube.hostIp6 ];
    
    registerWithTaints = lib.optional (config.kube.role == "primary") {
      key = "role";
      value = config.kube.role;
      effect = "NoSchedule";
    };
  };

in
{

  virtualisation.cri-o = {
    enable = true;
    extraPackages = [ pkgs.nftables ];
    settings.crio.runtime.log_to_journald = true;
  };

  systemd.services.kubelet = {
    wantedBy = [ "multi-user.target" ];
    requires = [
      "generate-kubeconfig-node.service"
      "crio.service"
    ];
    after = [
      "generate-kubeconfig-node.service"
      "crio.service"
      "kube-activation.service"
    ];
    path = [ pkgs.util-linux ];
    serviceConfig = {
      Type = "notify";
      ExecStart =
        "${lib.getBin pkgs.kubernetes}/bin/kubelet"
        
        + " --kubeconfig='/run/kubeconfig-node'"
        
        + " --hostname-override="${config.kube.nodeHost}""
        
        + " --node-ip='${config.kube.hostIp6}'"
        
        + " --node-labels="role=${config.kube.role}""
        
        + " --config='${kubeletConfig}'";
      Restart = "on-failure";
      RestartSec = 10;
      StateDirectory = "kubelet";
    };
  };

  
  
  environment.etc."cni/net.d/10-crio-bridge.conflist".text = lib.mkForce (
    builtins.toJSON {
      cniVersion = "1.0.0";
      name = "brkube";
      plugins = [
        {
          type = "bridge";
          bridge = "brkube";
          isGateway = true;
          ipam = {
            type = "host-local";
            ranges = [
              [ { subnet = config.kube.nodeCidr6; } ]
              [ { subnet = config.kube.nodeCidr4; } ]
            ];
            routes = [
              { dst = "::/0"; }
              { dst = "0.0.0.0/0"; }
            ];
          };
        }
      ];
    }
  );

  
  
  networking.firewall.extraInputRules = ''
    ip6 saddr ${config.kube.primaryIp} tcp dport 10250 accept
    tcp dport 10250 reject
  '';

}

Testing

The setup should now be fully functional! If you login as root on the primary node, you can use kubectl:


NAME                 STATUS   ROLES    AGE   VERSION
node1.k8s.internal   Ready       19s   v1.34.1
node2.k8s.internal   Ready       12s   v1.34.1

With node2 in the listing, we know connectivity works from kubelet to API server. Starting a container with an interactive session also tests the opposite direction. In addition, we can test connectivity from the container to the internet:


/ 
Connecting to example.com (23.220.75.245:443)
writing to stdout

What next?

While this setup has all the essentials for workloads, a bunch of stuff is missing to make it more broadly useful.

An Ingress / Gateway controller helps route traffic to containers. The go-to used to be nginx-ingress, but nginx-ingress is going the way of the dodo. I had some fun hacking on caddy-ingress, but that’s still experimental. There’s a list of Gateway controllers and a list of Ingress controllers if you want to explore.

A storage provsioner can help with data persistence. The modern solution for this is CSI drivers. Provided are drivers for NFS and SMB shares, which are really useful if you’re coming from a setup where applications share some NFS directories hosted on the primary node. But storage for databases is ideally block storage, which is a bit more work.

Speaking of databases, the nice thing about this setup is that you can simply run services outside Kubernetes, so you can just start a database using regular NixOS config on the primary node for example. I had some fun writing my own controller that allows managing MySQL databases with custom Kubernetes resources: external-mysql-operator. Again, very experimental.

Takeaways

Would I take this into production? Not anytime soon, because I feel like there are a whole bunch of failure modes I’ve not yet seen. My testing has been limited to QEMU VMs and some AWS EC2 instances.

Especially on VMs, which are typically quite small compared to dedicated servers, Kubernetes itself uses up a chunk of memory and CPU just sitting there.

With the traction Kubernetes has, it does feel like there must be many small installations out there. And if that’s the case, it seems to me that Kubernetes could easily reduce some complexity for that type of installation.

For example, do you really need etcd and API server redundancy? It seems upstream SQLite support in combination with Litestream backups would be far more beneficial for smaller installations, when you’re happy to deal with some Kubernetes API downtime during upgrades or incidents.

Another easy win (in my opinion) would be runtime reloading of the token auth file. It would instantly make it a more viable option beyond testing. Though with a bit of extra work it can also be accomplished using the webhook or reverse proxy mechanisms supported by Kubernetes.

Overall, though, it feels like Kubernetes itself is maybe only half the complexity, with the other half going to network configuration.



Leave a Comment