Month: June 2021

OpenAI scaling kubernetes cluster to 7500 nodes

Interesting use of MPI, anti-affinity, gang scheduling, alias IP addresses. some bits I found interesting below.

“A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler.”

“We use iptables tagging on the host to track network resource usage per Namespace and pod. This lets researchers visualize their network usage patterns. In particular, since a lot of our experiments have distinct Internet and intra-pod communication patterns, it’s often useful to be able to investigate where any bottlenecks might be occurring.”

Previous blog on scaling to 2500 nodes –

“the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again”

“ARP cache size particularly relevant in Kubernetes clusters since every pod has its own IP address which consumes space in the ARP cache”

“We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that large pods can be scheduled quickly.”