Owning a $5M data center

These days it seems like you need a trillion fake dollars or lunch with politicians to get your own data center. They can help, but they are not needed. At Comma we have been running our own data center for years. All our model training, metrics and data reside in our own data center in our own office. Having your own data center is great, and in this blog post I’ll explain how ours works, so you can be inspired to have your own.

side view2
our data center

Why no clouds?

If your business depends on compute, and you run that compute in the cloud, you’re relying heavily on your cloud provider. Cloud companies generally make onboarding too easy and offboarding too difficult. If you are not careful you will be stuck sleeping in a high cost situation with no way out. If you want to control your own destiny, you have to run your own calculations.

Self-reliance is great, but running your own computer has other benefits, too. It inspires good engineering. Data center maintenance is about more than solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. Data center requires knowledge of watts, bits and FLOPs. I know who I want to think about.

Avoiding the cloud for ML also provides better incentives for engineers. Engineers generally want to improve things. In ML, many problems are solved simply by using more compute. In the cloud this means the reforms are far from just budget increases. This locks you into inefficient and expensive solutions. Instead, when you only have your current calculations available, the quickest fix is ​​usually to speed up your code, or fix fundamental problems.

Finally there is cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage requirements are fairly consistent, which is true if you’re in the business of training or running models. In case of commas I estimate we spent ~5 million on our data center, and if we had done the same thing in the cloud we would have spent over $25 million.

What is needed after all?

Our data center is very simple. Its maintenance and construction is done by only a few engineers and technicians. Your needs may differ slightly, our implementation should provide useful reference.

Power

You need power to run a server. Currently we use a maximum of 450kW. Operating a data center presents you with a lot of interesting engineering challenges, but buying electricity is not one of them. The cost of electricity in San Diego is over 40c/kWh, 3 times higher than the global average. It is a fraud and is priced higher only because of political dysfunction. We spent $540,112 on electricity in 2025, which is a large portion of the data center costs. I hope that in a future blog post I can tell you how we generate our electricity and that you should do the same.

power
data center power usage

coolant

Data centers require cool dry air. Usually this is achieved with CRAC systems, but they are power hungry. San Diego has a mild climate and we opted for the pure outdoor cool air. This gives us less control over temperature and humidity, but only uses a couple dozen kilowatts. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. We use a recirculating fan to mix the hot exhaust air with the intake air to ensure low humidity (<45%). A server is connected to multiple sensors and runs a PID loop to control fans to optimize temperature and humidity.

air cooling
Filtered intake fan on the right, 2 recirculating fans on top

server

The majority of our current compute is in 75 TinyBox Pro machines with 600 GPUs. They were created in-house, saving us money and ensuring they met our needs. Our self-built machines fail at the same rate as the pre-built machines we buy, but we are able to quickly fix them ourselves. They have 2 CPUs and 8 GPUs each, and they serve as both training machines and general computer workers.

breakers
Breaker panel for all the computers, that’s a lot of breakers!

We have a few racks of Dell machines (R630 and R730) for data storage. They are packed with SSDs for a total of ~4PB of storage. We use SSD for reliability and speed. There is no redundancy in our main storage arrays and each node must be able to saturate the network bandwidth with random access reads. For storage machines this means up to 20Gbps reads per 80TB volume.

In addition to storage and computer machines we have a number of one-off machines to run services. This includes a router, climate controller, data ingestion machine, storage master server, metric server, Redis server, and a few more.

Networks require switches to run, but at this scale we don’t need to bother with complex switch topologies. We have 3 100Gbps interconnected Z9264F switches, which serve as the main Ethernet network. We have two more InfiniBand switches to interconnect 2 Tinybox Pro groups for all-reduce training.

software

To use all these compute and storage machines effectively you need some infrastructure. At this scale, services do not need redundancy to achieve 99% uptime. We use a single master for all services, which makes things quite simple.

to install

All servers have ubuntu installed with pxeboot and managed by salt.

Distributed storage: minikeyvalue

All our storage arrays use mkv. The main array is 3PB of non-redundant storage that hosts our driving data on which we train. We can read from this array at ~1TB/s, meaning we can train directly on the raw data without caching. Redundancy is not needed because no specific data is important.

mkv machines
storage nodes

We have an additional ~300TB non-redundant array to cache intermediate processed results. And finally, we have a redundant MKV storage array to store all our trained models and training metrics. Each of these 3 arrays has a separate single master server.

Workload Management: Slurm

We use Slurm to manage compute nodes and compute jobs. We schedule two types of distributed computation. Pytorch training jobs, and mineray workers.

Distributed Training: Pytorch

To train the model across multiple GPU nodes we use torch.distributed FSDP. We have 2 separate training partitions, intra-connected with Infiniband for training across each machine. We’ve written our own training framework that handles the training loop boilerplate, but it’s mostly just Pytorch.

reporter
Reporter; comma usage tracking service

We have a custom model experiment tracking service (similar to Wandub or Tensorboard). It provides a dashboard for tracking experiments, and shows custom metrics and reports. It is also the interface to the MKV storage array that hosts the model weights. The training run stores the model weights with a UUID, and they are available to download to anyone who needs to run them. The metrics and reports for our latest models are also open.

Distributed Compute: Miniray

Apart from training we also have many other calculation tasks. This could be anything from running tests, running models, pre-processing data, or even running agent rollouts to on-policy training. We’ve written a lightweight open-source task scheduler called MiniRay that allows you to run arbitrary Python code on idle machines. This is a simplified version of Dusk, with a focus on extreme simplicity. Slurm will schedule any idle machine to become an active mineray worker, and accept pending tasks. All task information is hosted in a central Redis server.

Our main training/compute machines. Note the 400Gbps switch in the center.

Miniray workers with GPUs will spin up a Triton inference server to run model inference with dynamic batching. Thus a Miniarray worker model can easily and efficiently run any model hosted in a MKV storage array.

Miniray makes it extremely easy to scale parallel tasks to hundreds of machines. For example, the control challenge record was set by having only ~1 hour of access to our data center with MiniRay.

code nfs monorepo

All our code is in a monorepo that we cloned to our workstation. This monorepo is kept small (<3GB) so that it can be easily copied. When a training job or miniray distributed job is started on a workstation, the local monorepo is cached on a shared NFS drive, including all local changes. Training tasks and miniray tasks are pointed towards this cache, such that all distributed tasks use the exact codebase you have locally. Even if all Python packages are the same, the UV on the worker/trainer syncs the specified packages to the monorepo before starting any work. This entire process of copying your entire local codebase and syncing all packages only takes ~2 seconds, and is perfect for preventing mismatch issues.

all together Now

The most complex work we do at Comma is train driving models on policy, these training runs require generating training data by running a simulated driving rollout with the most recent model weights during training. Here is a real-world command we used to train such a model. This training run uses all the infrastructure described above. While only this little command is needed to get everything started, it sets up a lot of moving parts.

./training/train.sh N=4 partition=tbox2 trainer=mlsimdriving dataset=/home/batman/xx/datasets/lists/train_500k_20250717.txt vision_model=8d4e28c7-7078-4caf-ac7d-d0e41255c3d4/500 data.shuffle_size=125k optim.scheduler=COSINE bs=4
onpol
Diagram of all the infrastructure involved in training the on-policy driving model.

Like this stuff?

Does all this sound exciting? Then build your own datacenter for yourself or your company! You can also come here to work.

Harald Schaefer
CTO@comma.ai



<a href

Leave a Comment