This type of scaling is not just about increasing the number of nodes; It also requires scaling other important dimensions, e.g. pod creation And scheduling throughputFor example, during this test, we maintained pod throughput 1,000 pods per secondas well as storage 1 million objects In our optimized distributed storage. In this blog, we take a look at the trends driving demand for these types of mega-clusters, and take a deeper look at the architectural innovations we implemented to make this extreme scalability a reality.
Rise of mega clusters
Our largest customers are actively pushing the limits of GKE’s scalability and performance with their AI workloads. In fact, we already have many customers operating clusters in the 20-65K node range, and we expect demand for larger clusters to stabilize around the 100K node mark.
This sets up an interesting dynamic. In short, we are transitioning from a world constrained by chip supply to a world constrained by electrical power. Consider the fact that a single NVIDIA GB200 GPU requires 2700W of power. With thousands, or even more, of these chips, the power footprint of a single cluster can easily grow to hundreds of megawatts – ideally distributed across multiple data centers. Thus, for AI platforms with more than 100K nodes, we will need robust multi-cluster solutions that can organize distributed training or reinforcement learning across clusters and data centers. This is a significant challenge, and we are actively investing in tools like multiqueue To address this, with more innovations on the horizon. We are also leveraging our recently announced high-performance RDMA networking Managed DRANETImproving topology awareness to maximize performance for large-scale AI workloads. Stay tuned.
At the same time, these investments also benefit users who operate on a more modest scale – the vast majority of GKE customers. By hardening GKE’s core systems for heavy use, we create enough headroom for average clusters, make them more resilient to errors, increase tolerance for user abuse of the Kubernetes API, and generally optimize all controllers for faster performance. And of course, all GKE customers, large and small, benefit from investing in a seamless, self-service experience.
Major architectural innovations
That said, achieving this level of scale requires significant innovations across the entire Kubernetes ecosystem, including the control plane, custom scheduling, and storage. Let’s take a look at some of the key areas that were important to this project.
optimized read scalability
When operating at scale, a strongly consistent and snapshottable API server watch cache is required. At 130,000 nodes, the sheer volume of read requests on the API server can overwhelm the central object datastore. To solve this, Kubernetes includes several complementary features to offload these read requests from a central object datastore.
First, the Consistent Reads from Cache feature (KEP-2340), detailed in HereThe API enables the server to serve strongly consistent data directly from its in-memory cache. This significantly reduces the load on the object storage database for common read patterns such as filtered list requests (for example, “all pods on a specific node”) by ensuring that the cache’s data is verifiably up to date before serving the request.
On this basis, the SnapshotTable API Server Cache feature (KEP-4988) further increases performance by allowing the API Server to serve LIST requests for previous states (via pagination or by specifying resourceVersion) directly from the same consistent watch cache. By generating a B-tree “snapshot” of the cache at a specific resource version, the API server can efficiently handle subsequent LIST requests without having to repeatedly query the datastore.
Together, these two enhancements address the problem of read amplification, ensuring that the API remains fast and responsive by serving strongly consistent filtered read and list requests of past states directly from server memory. This is necessary to maintain cluster-wide component health at peak levels.
An optimized distributed storage backend
To support the huge scale of the cluster, we relied on a proprietary key-value store based on Google’s Spanner distributed database. On 130K nodes, we need 13,000 QPS to update lease objects, ensuring that critical cluster operations like node health checks are not disrupted, and providing the stability needed for the entire system to operate reliably. We saw no bottlenecks with respect to the new storage system and no indication of it not being able to support higher scale.
queue for advanced job queue
The default Kubernetes scheduler is designed to schedule individual pods, but complex AI/ML environments require more sophisticated, job-level management. queue is a job queuing controller that brings batch system capabilities to Kubernetes. It decides *when* a job should be accepted based on fair-sharing policies, priorities, and resource quotas, and enables “all-or-nothing” scheduling for entire jobs. Built on top of the default scheduler, Queue provided the orchestration needed to manage the complex mix of competing training, batch, and inference workloads in our benchmarks.
The future of scheduling: increased workload awareness
Beyond job-level queuing, the Kubernetes ecosystem is evolving toward workload-aware scheduling at its core. The goal is to move from a pod-centric to a workload-centric approach to scheduling. This means that the scheduler will make placement decisions taking into account the needs of the entire workload as a unit, incorporating both available and potential capacity. This holistic approach is critical to optimizing cost-performance, especially for the new wave of AI/ML training and inference workloads.
A key aspect of the emerging Kubernetes scheduler is the native implementation of gang scheduling semantics within Kubernetes, a feature currently provided by add-ons like Queue. The community is actively working on it KEP-4671: Gang Determination,
Over time, support for workload-aware scheduling in core Kubernetes will make it easier to orchestrate large-scale, tightly coupled applications on GKE, making the platform even more powerful for demanding AI/ML and HPC use cases. We are also working on integrating Queues as a second level scheduler within GKE.
GCS FUSE for data access
AI workloads must be able to access data efficiently. Together, cloud storage fuse with parallel download and caching Enabled and paired with zonal cash anywhereCloud storage allows access to model data in the bucket as if it were a local file system, reducing latency by up to 70%. It provides a scalable, high-throughput mechanism to feed data into distributed jobs or scale-out estimation workflows. Alternatively, there is Google Cloud Managed BrightnessA fully managed persistent zonal storage solution that supports workloads requiring multi-petabyte capacity, TB/s throughput, and sub-millisecond latency. You can learn more about your storage options for AI/ML workloads Here,
Benchmarking GKE for large-scale, dynamic AI workloads
To validate the performance of GKE with large-scale AI/ML workloads, we designed a four-phase benchmark simulating a dynamic environment with complex resource management, prioritization, and scheduling challenges. This is based on the benchmark used Previous 65K node scale test,
We have upgraded the benchmark to represent a typical AI platform that hosts mixed workloads using workloads with different priority classes:
- Low priority: Preemptable batch processing, such as data preparation tasks.
- Medium Priority: Core model training tasks that are important but can tolerate some queues.
- high priority: Latency-sensitive, user-facing inference services that must have guaranteed resources.
We streamlined the process using queues to manage quotas and resource sharing and jobsets to manage training jobs.
Step 1: Establishing a Performance Baseline with a Large Training Task
To start, we measure the basic performance of the cluster by scheduling a single, large-scale training workload. we deploy one JobSet configured to run 130,000 medium-priority pods simultaneously. This initial testing allows us to establish a baseline for key metrics such as pod startup latency and overall scheduling throughput, revealing the overhead of launching substantial workloads on a clean cluster. This set the stage for evaluating the performance of GKE under more complex conditions. After execution, we removed this jobset from the cluster, leaving an empty cluster for step 2.
<a href