llm – Secure Machinery

When choosing the EC2 instance for a Large Language Model, one of the first constraints is whether the model will fit in the GPU memory of an instance.

Given a choice of a model, the decisions roughly follow this path –

Model -> Training/Inferencing -> Technique (choice of optimization) -> Memory requirement -> Instance requirement -> Instance availability -> smaller instance or more optimization or distributed training.

Some extreme optimizations are possible such as QLora for Inferencing . See the blog How to fit a layer in memory at a time https://huggingface.co/blog/lyogavin/airllm . However many use cases do not want any sacrifices in accuracy.

Distributed training by splitting the model against smaller instances is another possibility. A discussion is here – https://siboehm.com/articles/22/pipeline-parallel-training

Here’s a listing of different GPU instance types with a column for GPU Memory (GiB) on one page to facilitate instance comparisons.

EC2 G3 Instance Details

Name	GPUs	vCPU	Memory (GiB)	GPU Memory (GiB)	*Price/hr (Linux)**	Price/hr* (Windows)	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
g3s.xlarge	1	4	30.5	8	$0.75	$0.93	$0.525	$0.405
g3.4xlarge	1	16	122	8	$1.14	$1.876	$0.741	$0.538
g3.8xlarge	2	32	244	16	$2.28	$3.752	$1.482	$1.076
g3.16xlarge	4	64	488	32	$4.56	$7.504	$2.964	$2.152

EC2 G4 Instance details

	Instance Size	GPU	vCPUs	Memory (GiB)	Instance Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On-Demand Price/hr*	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
G4dn
Single GPU VMs	g4dn.xlarge	1	4	16	1 x 125 NVMe SSD	Up to 25	Up to 3.5	$0.526	$0.316	$0.210
	g4dn.2xlarge	1	8	32	1 x 225 NVMe SSD	Up to 25	Up to 3.5	$0.752	$0.452	$0.300
	g4dn.4xlarge	1	16	64	1 x 225 NVMe SSD	Up to 25	4.75	$1.204	$0.722	$0.482
	g4dn.8xlarge	1	32	128	1 x 900 NVMe SSD	50	9.5	$2.176	$1.306	$0.870
	g4dn.16xlarge	1	64	256	1 x 900 NVMe SSD	50	9.5	$4.352	$2.612	$1.740

Multi GPU VMs	g4dn.12xlarge	4	48	192	1 x 900 NVMe SSD	50	9.5	$3.912	$2.348	$1.564
Multi GPU VMs	g4dn.metal	8	96	384	2 x 900 NVMe SSD	100	19	$7.824	$4.694	$3.130
G4ad
Single GPU VMs	g4ad.xlarge	1	4	16	1 x 150 NVMe SSD	Up to 10	Up to 3	$0.379	$0.227	$0.178
	g4ad.2xlarge	1	8	32	1 x 300 NVMe SSD	Up to 10	Up to 3	$0.541	$0.325	$0.254
	g4ad.4xlarge	1	16	64	1 x 600 NVMe SSD	Up to 10	Up to 3	$0.867	$0.520	$0.405

Multi GPU VMs	g4ad.8xlarge	2	32	128	1 x 1200 NVMe SSD	15	3	$1.734	$1.040	$0.810
Multi GPU VMs	g4ad.16xlarge	4	64	256	1 x 2400 NVMe SSD	25	6	$3.468	$2.081	$1.619

EC2 G5 instance details

	Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g5.xlarge	1	24	4	16	1×250	Up to 10	Up to 3.5	$1.006	$0.604	$0.402
	g5.2xlarge	1	24	8	32	1×450	Up to 10	Up to 3.5	$1.212	$0.727	$0.485
	g5.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.624	$0.974	$0.650
	g5.8xlarge	1	24	32	128	1×900	25	16	$2.448	$1.469	$0.979
	g5.16xlarge	1	24	64	256	1×1900	25	16	$4.096	$2.458	$1.638

Multi GPU VMs	g5.12xlarge	4	96	48	192	1×3800	40	16	$5.672	$3.403	$2.269
	g5.24xlarge	4	96	96	384	1×3800	50	19	$8.144	$4.886	$3.258
	g5.48xlarge	8	192	192	768	2×3800	100	19	$16.288	$9.773	$6.515

EC2 G6 instance details

	Instance Size	GPU	GPU Memory (GB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g6.xlarge	1	24	4	16	1×250	Up to 10	Up to 5	$0.805	$0.499	$0.342
	g6.2xlarge	1	24	8	32	1×450	Up to 10	Up to 5	$0.978	$0.606	$0.416
	g6.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.323	$0.820	$0.562
	g6.8xlarge	1	24	32	128	2×450	25	16	$2.014	$1.249	$0.856
	g6.16xlarge	1	24	64	256	2×940	25	20	$3.397	$2.106	$1.443
	Gr6 instances with 1:8 vCPU:RAM ratio
	gr6.4xlarge	1	24	16	128	1×600	Up to 25	8	$1.539	$0.954	$0.654
	gr6.8xlarge	1	24	32	256	2×450	25	16	$2.446	$1.517	$1.040

Multi GPU VMs	g6.12xlarge	4	96	48	192	4×940	40	20	$4.602	$2.853	$1.955
	g6.24xlarge	4	96	96	384	4×940	50	30	$6.675	$4.139	$2.837
	g6.48xlarge	8	192	192	768	8×940	100	60	$13.35	$8.277	$5.674

EC2 G6e instances

Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory(GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
g6e.xlarge	1	48	4	32	250	Up to 20	Up to 5
g6e.2xlarge	1	48	8	64	450	Up to 20	Up to 5
g6e.4xlarge	1	48	16	128	600	20	8
g6e.8xlarge	1	48	32	256	900	25	16
g6e.16xlarge	1	48	64	512	1900	35	20
g6e.12xlarge	4	192	48	384	3800	100	20
g6e.24xlarge	4	192	96	768	3800	200	30
g6e.48xlarge	8	384	192	1536	7600	400	60

EC2 P3 instance details

Instance Size	GPUs – Tesla V100	GPU Peer to Peer	GPU Memory (GB)	vCPUs	Memory (GB)	Network Bandwidth	EBS Bandwidth	On-Demand Price/hr*	1-yr Reserved Instance Effective Hourly*	3-yr Reserved Instance Effective Hourly*
p3.2xlarge	1	N/A	16	8	61	Up to 10 Gbps	1.5 Gbps	$3.06	$1.99	$1.05
p3.8xlarge	4	NVLink	64	32	244	10 Gbps	7 Gbps	$12.24	$7.96	$4.19
p3.16xlarge	8	NVLink	128	64	488	25 Gbps	14 Gbps	$24.48	$15.91	$8.39
p3dn.24xlarge	8	NVLink	256	96	768	100 Gbps	19 Gbps	$31.218	$18.30	$9.64

EC2 P4 instance details

Instance Size	vCPUs	Instance Memory (GiB)	GPU – A100	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (GB)	EBS Bandwidth (Gbps)	On-demand Price/hr	1-yr Reserved Instance Effective Hourly *	3-yr Reserved Instance Effective Hourly *
p4d.24xlarge	96	1152	8	320 GB HBM2	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$32.77	$19.22	$11.57
p4de.24xlarge (preview)	96	1152	8	640 GB HBM2e	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$40.96	$24.01	$14.46

EC2 P5 instance details

Instance Size	vCPU	Instance Memory (TiB)	GPU – H100	GPU Memory	Network Bandwidth	GPUDirectRDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5.48xlarge	192	2	8	640 GB HBM3	3200 Gbps EFAv2	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

EC2 P5e instance details

Instance Size	vCPUs	Instance Memory (TiB)	GPU	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5e.48xlarge	192	2	8 x NVIDIA H200	1128 GB HBM3e	3200 Gbps EFA	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

Relevant links

P5e and P5en announcement (update Sep’24). https://aws.amazon.com/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

Use of Triton and NIM to make use of GPU memory across multiple GPUs on an instance –

https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow

https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nims-on-amazon-eks

FP4 and four bit integer quantization, and QLoRA

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA at https://huggingface.co/blog/4bit-transformers-bitsandbytes

[2305.14152] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Note: Performance is not just about GPU memory but also network bandwidth which is needed to load the large models especially for a platform serving multiple models.

When comparing the importance of high memory bandwidth between training and inference for Large Language Models (LLMs), it is generally more critical for training. Here’s why:

1. Training LLMs

Data Movement: Training LLMs involves frequent data movement between the GPU memory and the processing units. Each training iteration requires loading large batches of data, performing extensive matrix multiplications, and updating weights, all of which are memory-intensive operations.
Backward Pass: During the training phase, the backward pass (gradient computation and backpropagation) is highly memory bandwidth-intensive. The gradients of each layer are computed and propagated back through the network, requiring significant memory access.
Parameter Updates: High memory bandwidth is essential to handle the large volume of data being read and written during the parameter updates across multiple layers, especially in very deep models.
Larger Models and Datasets: Training large models like GPT-3 or GPT-4 involves massive datasets and millions (or even billions) of parameters, leading to a substantial demand for memory bandwidth.

2. Inferencing of LLMs:

Data Movement: During inference, the primary task is to process input data and generate outputs, which involves reading the model parameters and performing computations. While this still requires good memory bandwidth, the demands are generally lower compared to training.
No Backpropagation: Inference does not involve the backward pass or parameter updates, significantly reducing the need for continuous memory writes. The absence of gradient computations and updates reduces the overall memory bandwidth requirements.
Smaller Batch Sizes: Inference typically operates on smaller batch sizes compared to training, further reducing the demand for memory bandwidth.
Optimizations: Techniques such as model quantization and optimized inference runtimes (like TensorRT) can reduce the memory bandwidth required during inference by optimizing how data is accessed and processed.

Large Language Models (LLMs) have transformed natural language processing, but their immense size and computational demands pose significant challenges. Optimizing these models is crucial for efficient deployment, particularly in resource-constrained environments. Below, we explore several optimization techniques, including Parameter-Efficient Fine-Tuning (PEFT), Low-Rank Adaptation (LoRA), and Quantized Low-Rank Adaptation (QLoRA), highlighting their unique benefits and differences.

1. Parameter-Efficient Fine-Tuning (PEFT)

PEFT is designed to reduce the computational burden of fine-tuning large models by updating only a small subset of the model’s parameters, rather than the entire model. This approach allows for significant resource savings while maintaining performance, making it particularly useful for adapting LLMs to new tasks with limited data or compute resources.

Key Features:

Selective Parameter Update: Only a fraction of the model’s parameters are fine-tuned.
Efficiency: Reduces the computational cost and memory footprint during fine-tuning.
Flexibility: Can be applied across various LLM architectures.

2. Low-Rank Adaptation (LoRA)

LoRA is a technique that further reduces the number of parameters to be updated during fine-tuning by decomposing the model’s weight matrices into low-rank components. By introducing low-rank matrices that are trained alongside the existing weights, LoRA enables fine-tuning with minimal additional parameters, preserving the original model’s architecture.

Key Features:

Low-Rank Decomposition: Decomposes weights into low-rank matrices to minimize parameter updates.
Minimal Overhead: Adds only a small number of trainable parameters.
Performance: Maintains or even enhances model performance on specific tasks.

3. Quantized Low-Rank Adaptation (QLoRA)

QLoRA combines quantization and LoRA to maximize memory and computational efficiency. By quantizing the low-rank matrices, QLoRA reduces the precision of these components, allowing for even greater reductions in memory usage and computational costs without a significant loss in accuracy.

Key Features:

Quantization: Reduces precision of low-rank matrices to lower memory usage.
Memory Efficiency: Significantly decreases the memory required for fine-tuning.
Scalability: Ideal for large-scale deployments where memory is a critical concern.

Contrasting PEFT, LoRA, and QLoRA

Parameter Update Strategy:
- PEFT: Updates a small subset of existing parameters.
- LoRA: Introduces additional low-rank matrices for parameter updates.
- QLoRA: Combines low-rank matrices with quantization for extreme memory efficiency.
Memory and Computational Requirements:
- PEFT: Reduces overall fine-tuning costs but may still require substantial memory.
- LoRA: Further reduces memory usage by minimizing the number of updated parameters.
- QLoRA: Offers the most memory efficiency by applying quantization to the low-rank matrices.
Application Scenarios:
- PEFT: Suitable for fine-tuning in environments with limited compute resources.
- LoRA: Ideal for scenarios requiring efficient fine-tuning with minimal parameter overhead.
- QLoRA: Best for large-scale deployments where memory efficiency is paramount.

Secure Machinery

On the evolution of security and intelligent machinery

Category: llm

Sizing an LLM for GPU memory

EC2 G3 Instance Details

EC2 G4 Instance details

EC2 G5 instance details

EC2 G6 instance details

EC2 G6e instances

EC2 P3 instance details

EC2 P4 instance details

EC2 P5 instance details

EC2 P5e instance details

SageMaker Hyperpod for Distributed Model Training

LLM optimization – PEFT, LORA, QLORA

1. Parameter-Efficient Fine-Tuning (PEFT)

2. Low-Rank Adaptation (LoRA)

3. Quantized Low-Rank Adaptation (QLoRA)

Contrasting PEFT, LoRA, and QLoRA