Month: July 2024

Sizing an LLM for GPU memory

When choosing the EC2 instance for a Large Language Model, one of the first constraints is whether the model will fit in the GPU memory of an instance.

Given a choice of a model, the decisions roughly follow this path –

Model -> Training/Inferencing -> Technique (choice of optimization) -> Memory requirement -> Instance requirement -> Instance availability -> smaller instance or more optimization or distributed training.

Some extreme optimizations are possible such as QLora for Inferencing . See the blog How to fit a layer in memory at a time https://huggingface.co/blog/lyogavin/airllm . However many use cases do not want any sacrifices in accuracy.

Distributed training by splitting the model against smaller instances is another possibility. A discussion is here – https://siboehm.com/articles/22/pipeline-parallel-training

Here’s a listing of different GPU instance types with a column for GPU Memory (GiB) on one page to facilitate instance comparisons.

EC2 G3 Instance Details
NameGPUsvCPUMemory (GiB)GPU Memory (GiB)Price/hr* (Linux)Price/hr* (Windows)1-yr Reserved Instance Effective Hourly* (Linux)3-yr Reserved Instance Effective Hourly* (Linux)
g3s.xlarge1430.58$0.75
$0.93
$0.525$0.405
g3.4xlarge1161228$1.14$1.876$0.741$0.538
g3.8xlarge23224416$2.28$3.752$1.482$1.076
g3.16xlarge46448832$4.56$7.504$2.964$2.152
EC2 G4 Instance details
 Instance SizeGPUvCPUsMemory (GiB)Instance Storage (GB)Network Bandwidth (Gbps)EBS Bandwidth (Gbps)On-Demand Price/hr*1-yr Reserved Instance Effective Hourly* (Linux)3-yr Reserved Instance Effective Hourly* (Linux)

G4dn

Single GPU VMsg4dn.xlarge14161 x 125 NVMe SSDUp to 25Up to 3.5$0.526$0.316$0.210
g4dn.2xlarge18321 x 225 NVMe SSDUp to 25Up to 3.5$0.752$0.452$0.300
g4dn.4xlarge116641 x 225 NVMe SSDUp to 254.75$1.204$0.722$0.482
g4dn.8xlarge1321281 x 900 NVMe SSD509.5$2.176$1.306$0.870
g4dn.16xlarge1642561 x 900 NVMe SSD509.5$4.352$2.612$1.740
           
Multi GPU VMsg4dn.12xlarge4481921 x 900 NVMe SSD509.5$3.912$2.348$1.564
g4dn.metal8963842 x 900 NVMe SSD10019$7.824$4.694$3.130

G4ad

Single GPU VMsg4ad.xlarge14161 x 150 NVMe SSDUp to 10Up to 3$0.379$0.227$0.178
g4ad.2xlarge18321 x 300 NVMe SSDUp to 10Up to 3$0.541$0.325$0.254
g4ad.4xlarge116641 x 600 NVMe SSDUp to 10Up to 3$0.867$0.520$0.405
           
Multi GPU VMsg4ad.8xlarge2321281 x 1200 NVMe SSD153$1.734$1.040$0.810
g4ad.16xlarge4642561 x 2400 NVMe SSD256$3.468$2.081$1.619
EC2 G5 instance details
 Instance SizeGPUGPU Memory (GiB)vCPUsMemory (GiB)Storage (GB)Network Bandwidth (Gbps)EBS Bandwidth (Gbps)On Demand Price/hr*1-yr ISP Effective Hourly (Linux)3-yr ISP Effective Hourly (Linux)
Single GPU VMsg5.xlarge1244161×250Up to 10Up to 3.5$1.006$0.604$0.402
g5.2xlarge1248321×450Up to 10Up to 3.5$1.212$0.727$0.485
g5.4xlarge12416641×600Up to 258$1.624$0.974$0.650
g5.8xlarge124321281×9002516$2.448$1.469$0.979
g5.16xlarge124642561×19002516$4.096$2.458$1.638
            
Multi GPU VMsg5.12xlarge496481921×38004016$5.672$3.403$2.269
g5.24xlarge496963841×38005019$8.144$4.886$3.258
g5.48xlarge81921927682×380010019$16.288$9.773$6.515
EC2 G6 instance details
 Instance SizeGPUGPU Memory (GB)vCPUsMemory (GiB)Storage (GB)Network Bandwidth (Gbps)EBS Bandwidth (Gbps)On Demand Price/hr*1-yr ISP Effective Hourly (Linux)3-yr ISP Effective Hourly (Linux)
Single GPU VMs          g6.xlarge1244161×250Up to 10Up to 5$0.805$0.499$0.342
g6.2xlarge1248321×450Up to 10Up to 5$0.978$0.606$0.416
g6.4xlarge12416641×600Up to 258$1.323$0.820$0.562
g6.8xlarge124321282×4502516$2.014$1.249$0.856
g6.16xlarge124642562×9402520$3.397$2.106$1.443
Gr6 instances with 1:8 vCPU:RAM ratio
gr6.4xlarge124161281×600Up to 258$1.539$0.954$0.654
gr6.8xlarge124322562×4502516$2.446$1.517$1.040
            
Multi GPU VMsg6.12xlarge496481924×9404020$4.602$2.853$1.955
g6.24xlarge496963844×9405030$6.675$4.139$2.837
g6.48xlarge81921927688×94010060$13.35$8.277$5.674
EC2 G6e instances
Instance SizeGPUGPU Memory (GiB)  vCPUsMemory(GiB)Storage (GB)  Network Bandwidth (Gbps)  EBS Bandwidth (Gbps)
g6e.xlarge148432250Up to 20Up to 5
g6e.2xlarge148864450Up to 20Up to 5
g6e.4xlarge14816128600208
g6e.8xlarge148322569002516
g6e.16xlarge1486451219003520
g6e.12xlarge419248384380010020
g6e.24xlarge419296768380020030
g6e.48xlarge83841921536760040060
EC2 P3 instance details
Instance SizeGPUs – Tesla V100GPU Peer to PeerGPU Memory (GB)vCPUsMemory (GB)Network BandwidthEBS BandwidthOn-Demand Price/hr*1-yr Reserved Instance Effective Hourly*3-yr Reserved Instance Effective Hourly*
p3.2xlarge1N/A16861Up to 10 Gbps1.5 Gbps$3.06$1.99$1.05
p3.8xlarge4
NVLink643224410 Gbps7 Gbps$12.24$7.96$4.19
p3.16xlarge8NVLink1286448825 Gbps14 Gbps$24.48$15.91$8.39
p3dn.24xlarge8NVLink25696768100 Gbps19 Gbps$31.218$18.30$9.64
EC2 P4 instance details
Instance SizevCPUsInstance Memory (GiB)GPU – A100GPU memoryNetwork Bandwidth (Gbps)GPUDirect RDMAGPU Peer to PeerInstance Storage (GB)EBS Bandwidth (Gbps)On-demand Price/hr1-yr Reserved Instance Effective Hourly *3-yr Reserved Instance Effective Hourly *
p4d.24xlarge9611528320 GB
HBM2
400 ENA and EFAYes600 GB/s NVSwitch8 x 1000 NVMe SSD19$32.77$19.22$11.57
p4de.24xlarge (preview)9611528640 GB
HBM2e
400 ENA and EFAYes600 GB/s NVSwitch8 x 1000 NVMe SSD19$40.96$24.01$14.46
EC2 P5 instance details
Instance SizevCPUInstance Memory (TiB)GPU – H100GPU  MemoryNetwork BandwidthGPUDirectRDMAGPU Peer to PeerInstance Storage (TB)EBS Bandwidth (Gbps)
p5.48xlarge1928640 GB HBM33200 Gbps EFAv2Yes900 GB/s NVSwitch8 x 3.84 NVMe SSD80 
EC2 P5e instance details
Instance SizevCPUsInstance Memory (TiB)GPUGPU memoryNetwork Bandwidth (Gbps)GPUDirect RDMAGPU Peer to PeerInstance Storage (TB)EBS Bandwidth (Gbps)
p5e.48xlarge19228 x NVIDIA H2001128 GB
HBM3e
3200 Gbps EFAYes900 GB/s NVSwitch8 x 3.84 NVMe SSD80

Relevant links

P5e and P5en announcement (update Sep’24). https://aws.amazon.com/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

Use of Triton and NIM to make use of GPU memory across multiple GPUs on an instance –

https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow

https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nims-on-amazon-eks

FP4 and four bit integer quantization, and QLoRA

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA at https://huggingface.co/blog/4bit-transformers-bitsandbytes

[2305.14152] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Note: Performance is not just about GPU memory but also network bandwidth which is needed to load the large models especially for a platform serving multiple models.

When comparing the importance of high memory bandwidth between training and inference for Large Language Models (LLMs), it is generally more critical for training. Here’s why:

1. Training LLMs

  • Data Movement: Training LLMs involves frequent data movement between the GPU memory and the processing units. Each training iteration requires loading large batches of data, performing extensive matrix multiplications, and updating weights, all of which are memory-intensive operations.
  • Backward Pass: During the training phase, the backward pass (gradient computation and backpropagation) is highly memory bandwidth-intensive. The gradients of each layer are computed and propagated back through the network, requiring significant memory access.
  • Parameter Updates: High memory bandwidth is essential to handle the large volume of data being read and written during the parameter updates across multiple layers, especially in very deep models.
  • Larger Models and Datasets: Training large models like GPT-3 or GPT-4 involves massive datasets and millions (or even billions) of parameters, leading to a substantial demand for memory bandwidth.

2. Inferencing of LLMs:

  • Data Movement: During inference, the primary task is to process input data and generate outputs, which involves reading the model parameters and performing computations. While this still requires good memory bandwidth, the demands are generally lower compared to training.
  • No Backpropagation: Inference does not involve the backward pass or parameter updates, significantly reducing the need for continuous memory writes. The absence of gradient computations and updates reduces the overall memory bandwidth requirements.
  • Smaller Batch Sizes: Inference typically operates on smaller batch sizes compared to training, further reducing the demand for memory bandwidth.
  • Optimizations: Techniques such as model quantization and optimized inference runtimes (like TensorRT) can reduce the memory bandwidth required during inference by optimizing how data is accessed and processed.