EC2 Trainium UltraClusters

Each EC2 Trn1 instance has

  • up to 16 AWS Trainium accelerators purpose built to accelerate DL training and deliver up to 3.4 petaflops of FP16/BF16 compute power. Each accelerator includes two second-generation NeuronCores
  • 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth
  • 1600 Gbps of Elastic Fabric Adapter (EFAv2)

An EC2 Trn1 UltraCluster, consists of densely packed, co-located racks of Trn1 compute instances interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.

https://aws.amazon.com/blogs/machine-learning/scaling-large-language-model-llm-training-with-amazon-ec2-trn1-ultraclusters/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s