AWS – Secure Machinery

When choosing the EC2 instance for a Large Language Model, one of the first constraints is whether the model will fit in the GPU memory of an instance.

Given a choice of a model, the decisions roughly follow this path –

Model -> Training/Inferencing -> Technique (choice of optimization) -> Memory requirement -> Instance requirement -> Instance availability -> smaller instance or more optimization or distributed training.

Some extreme optimizations are possible such as QLora for Inferencing . See the blog How to fit a layer in memory at a time https://huggingface.co/blog/lyogavin/airllm . However many use cases do not want any sacrifices in accuracy.

Distributed training by splitting the model against smaller instances is another possibility. A discussion is here – https://siboehm.com/articles/22/pipeline-parallel-training

Here’s a listing of different GPU instance types with a column for GPU Memory (GiB) on one page to facilitate instance comparisons.

EC2 G3 Instance Details

Name	GPUs	vCPU	Memory (GiB)	GPU Memory (GiB)	*Price/hr (Linux)**	Price/hr* (Windows)	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
g3s.xlarge	1	4	30.5	8	$0.75	$0.93	$0.525	$0.405
g3.4xlarge	1	16	122	8	$1.14	$1.876	$0.741	$0.538
g3.8xlarge	2	32	244	16	$2.28	$3.752	$1.482	$1.076
g3.16xlarge	4	64	488	32	$4.56	$7.504	$2.964	$2.152

EC2 G4 Instance details

	Instance Size	GPU	vCPUs	Memory (GiB)	Instance Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On-Demand Price/hr*	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
G4dn
Single GPU VMs	g4dn.xlarge	1	4	16	1 x 125 NVMe SSD	Up to 25	Up to 3.5	$0.526	$0.316	$0.210
	g4dn.2xlarge	1	8	32	1 x 225 NVMe SSD	Up to 25	Up to 3.5	$0.752	$0.452	$0.300
	g4dn.4xlarge	1	16	64	1 x 225 NVMe SSD	Up to 25	4.75	$1.204	$0.722	$0.482
	g4dn.8xlarge	1	32	128	1 x 900 NVMe SSD	50	9.5	$2.176	$1.306	$0.870
	g4dn.16xlarge	1	64	256	1 x 900 NVMe SSD	50	9.5	$4.352	$2.612	$1.740

Multi GPU VMs	g4dn.12xlarge	4	48	192	1 x 900 NVMe SSD	50	9.5	$3.912	$2.348	$1.564
Multi GPU VMs	g4dn.metal	8	96	384	2 x 900 NVMe SSD	100	19	$7.824	$4.694	$3.130
G4ad
Single GPU VMs	g4ad.xlarge	1	4	16	1 x 150 NVMe SSD	Up to 10	Up to 3	$0.379	$0.227	$0.178
	g4ad.2xlarge	1	8	32	1 x 300 NVMe SSD	Up to 10	Up to 3	$0.541	$0.325	$0.254
	g4ad.4xlarge	1	16	64	1 x 600 NVMe SSD	Up to 10	Up to 3	$0.867	$0.520	$0.405

Multi GPU VMs	g4ad.8xlarge	2	32	128	1 x 1200 NVMe SSD	15	3	$1.734	$1.040	$0.810
Multi GPU VMs	g4ad.16xlarge	4	64	256	1 x 2400 NVMe SSD	25	6	$3.468	$2.081	$1.619

EC2 G5 instance details

	Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g5.xlarge	1	24	4	16	1×250	Up to 10	Up to 3.5	$1.006	$0.604	$0.402
	g5.2xlarge	1	24	8	32	1×450	Up to 10	Up to 3.5	$1.212	$0.727	$0.485
	g5.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.624	$0.974	$0.650
	g5.8xlarge	1	24	32	128	1×900	25	16	$2.448	$1.469	$0.979
	g5.16xlarge	1	24	64	256	1×1900	25	16	$4.096	$2.458	$1.638

Multi GPU VMs	g5.12xlarge	4	96	48	192	1×3800	40	16	$5.672	$3.403	$2.269
	g5.24xlarge	4	96	96	384	1×3800	50	19	$8.144	$4.886	$3.258
	g5.48xlarge	8	192	192	768	2×3800	100	19	$16.288	$9.773	$6.515

EC2 G6 instance details

	Instance Size	GPU	GPU Memory (GB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g6.xlarge	1	24	4	16	1×250	Up to 10	Up to 5	$0.805	$0.499	$0.342
	g6.2xlarge	1	24	8	32	1×450	Up to 10	Up to 5	$0.978	$0.606	$0.416
	g6.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.323	$0.820	$0.562
	g6.8xlarge	1	24	32	128	2×450	25	16	$2.014	$1.249	$0.856
	g6.16xlarge	1	24	64	256	2×940	25	20	$3.397	$2.106	$1.443
	Gr6 instances with 1:8 vCPU:RAM ratio
	gr6.4xlarge	1	24	16	128	1×600	Up to 25	8	$1.539	$0.954	$0.654
	gr6.8xlarge	1	24	32	256	2×450	25	16	$2.446	$1.517	$1.040

Multi GPU VMs	g6.12xlarge	4	96	48	192	4×940	40	20	$4.602	$2.853	$1.955
	g6.24xlarge	4	96	96	384	4×940	50	30	$6.675	$4.139	$2.837
	g6.48xlarge	8	192	192	768	8×940	100	60	$13.35	$8.277	$5.674

EC2 G6e instances

Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory(GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
g6e.xlarge	1	48	4	32	250	Up to 20	Up to 5
g6e.2xlarge	1	48	8	64	450	Up to 20	Up to 5
g6e.4xlarge	1	48	16	128	600	20	8
g6e.8xlarge	1	48	32	256	900	25	16
g6e.16xlarge	1	48	64	512	1900	35	20
g6e.12xlarge	4	192	48	384	3800	100	20
g6e.24xlarge	4	192	96	768	3800	200	30
g6e.48xlarge	8	384	192	1536	7600	400	60

EC2 P3 instance details

Instance Size	GPUs – Tesla V100	GPU Peer to Peer	GPU Memory (GB)	vCPUs	Memory (GB)	Network Bandwidth	EBS Bandwidth	On-Demand Price/hr*	1-yr Reserved Instance Effective Hourly*	3-yr Reserved Instance Effective Hourly*
p3.2xlarge	1	N/A	16	8	61	Up to 10 Gbps	1.5 Gbps	$3.06	$1.99	$1.05
p3.8xlarge	4	NVLink	64	32	244	10 Gbps	7 Gbps	$12.24	$7.96	$4.19
p3.16xlarge	8	NVLink	128	64	488	25 Gbps	14 Gbps	$24.48	$15.91	$8.39
p3dn.24xlarge	8	NVLink	256	96	768	100 Gbps	19 Gbps	$31.218	$18.30	$9.64

EC2 P4 instance details

Instance Size	vCPUs	Instance Memory (GiB)	GPU – A100	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (GB)	EBS Bandwidth (Gbps)	On-demand Price/hr	1-yr Reserved Instance Effective Hourly *	3-yr Reserved Instance Effective Hourly *
p4d.24xlarge	96	1152	8	320 GB HBM2	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$32.77	$19.22	$11.57
p4de.24xlarge (preview)	96	1152	8	640 GB HBM2e	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$40.96	$24.01	$14.46

EC2 P5 instance details

Instance Size	vCPU	Instance Memory (TiB)	GPU – H100	GPU Memory	Network Bandwidth	GPUDirectRDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5.48xlarge	192	2	8	640 GB HBM3	3200 Gbps EFAv2	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

EC2 P5e instance details

Instance Size	vCPUs	Instance Memory (TiB)	GPU	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5e.48xlarge	192	2	8 x NVIDIA H200	1128 GB HBM3e	3200 Gbps EFA	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

Relevant links

P5e and P5en announcement (update Sep’24). https://aws.amazon.com/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

Use of Triton and NIM to make use of GPU memory across multiple GPUs on an instance –

https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow

https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nims-on-amazon-eks

FP4 and four bit integer quantization, and QLoRA

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA at https://huggingface.co/blog/4bit-transformers-bitsandbytes

[2305.14152] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Note: Performance is not just about GPU memory but also network bandwidth which is needed to load the large models especially for a platform serving multiple models.

When comparing the importance of high memory bandwidth between training and inference for Large Language Models (LLMs), it is generally more critical for training. Here’s why:

1. Training LLMs

Data Movement: Training LLMs involves frequent data movement between the GPU memory and the processing units. Each training iteration requires loading large batches of data, performing extensive matrix multiplications, and updating weights, all of which are memory-intensive operations.
Backward Pass: During the training phase, the backward pass (gradient computation and backpropagation) is highly memory bandwidth-intensive. The gradients of each layer are computed and propagated back through the network, requiring significant memory access.
Parameter Updates: High memory bandwidth is essential to handle the large volume of data being read and written during the parameter updates across multiple layers, especially in very deep models.
Larger Models and Datasets: Training large models like GPT-3 or GPT-4 involves massive datasets and millions (or even billions) of parameters, leading to a substantial demand for memory bandwidth.

2. Inferencing of LLMs:

Data Movement: During inference, the primary task is to process input data and generate outputs, which involves reading the model parameters and performing computations. While this still requires good memory bandwidth, the demands are generally lower compared to training.
No Backpropagation: Inference does not involve the backward pass or parameter updates, significantly reducing the need for continuous memory writes. The absence of gradient computations and updates reduces the overall memory bandwidth requirements.
Smaller Batch Sizes: Inference typically operates on smaller batch sizes compared to training, further reducing the demand for memory bandwidth.
Optimizations: Techniques such as model quantization and optimized inference runtimes (like TensorRT) can reduce the memory bandwidth required during inference by optimizing how data is accessed and processed.

Each P5 EC2 instances has

eight NVIDIA H100 GPUs capable of 16 petaFLOPs of mixed-precision performance
640 GB of high-bandwidth memory, 80GB in each GPU
3,200 Gbps networking connectivity (8x more than the previous generation)

The increased performance of P5 instances accelerates the time-to-train machine learning (ML) models by up to 6x (reducing training time from days to hours), and the additional GPU memory helps customers train larger, more complex models.

P5 instances are expected to lower the cost to train ML models by up to 40% over the previous generation, providing customers greater efficiency over less flexible cloud offerings or expensive on-premises systems.

https://nvidianews.nvidia.com/news/aws-and-nvidia-collaborate-on-next-generation-infrastructure-for-training-large-machine-learning-models-and-building-generative-ai-applications

Nvidia H100 GPU overview and data sheet – https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper