Author: Ruchir Tewari

vLLM project – overview, comparisons, PagedAttention mechanism

September 29, 2024February 10, 2025 · Leave a comment ·

The vLLM project is an open-source venture designed to enhance the efficiency and scalability of serving Large Language Models (LLMs). Developed by researchers at UC Berkeley, vLLM aims to improve the performance of LLM inference by optimizing memory management and execution. It offers a system that reduces latency and increases throughput for LLMs, making it a valuable tool for deploying these models more effectively in various applications. It supports multiple LLM model types, multiple hardware architectures, and multiple optimization techniques. It is described in this paper, on Efficient LLM serving with PagedAttention.

vLLM achieves its improvements through

dynamic batching,
efficient memory usage, and
parallel execution strategies.

These features allow it to handle multiple requests simultaneously without sacrificing speed or accuracy.

By making LLMs more accessible and efficient, vLLM helps lower the barriers to using advanced AI models, facilitating broader adoption and innovation in the field of natural language processing. For more detailed information or to contribute to the project, you can explore its repository on platforms like GitHub.

vLLM, NVIDIA Triton Inference Server, and NVIDIA NeMo (formerly known as NVIDIA NIM) are all designed to improve the deployment and performance of machine learning models, but they have different focuses and functionalities. Here’s a comparison of each:

vLLM

Purpose: Optimizes the serving of Large Language Models (LLMs) with a focus on improving inference efficiency, particularly regarding memory management and execution.
Features: Offers dynamic batching, efficient memory usage, and parallel execution strategies specifically for LLMs, enhancing latency and throughput.
Use Cases: Best suited for applications requiring fast, efficient LLM inference, such as AI-driven conversational agents.
How it reduces memory waste and improves utilization with PagedAttention – https://blog.runpod.io/introduction-to-vllm-and-how-to-run-vllm-on-runpod-serverless/

NVIDIA Triton Inference Server

Purpose: A scalable and flexible platform for serving different types of machine learning models across a variety of frameworks and hardware architectures.
Features: Supports multiple model frameworks (e.g., TensorFlow, PyTorch, ONNX), dynamic batching, model versioning, and provides both HTTP/REST and gRPC endpoints for inference requests. It is designed to maximize GPU utilization and streamline inference workflows.
Use Cases: Ideal for deploying diverse AI models in production environments, allowing for efficient inference at scale across CPUs and GPUs.

NVIDIA NeMo

Purpose: A toolkit for building, training, and fine-tuning state-of-the-art conversational AI models, including those for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS).
Features: Provides pre-trained models, model architectures, and training scripts that can be customized and extended for specific tasks. NeMo is designed to facilitate the development of AI models with high accuracy and efficiency.
Use Cases: Suitable for developers and researchers focused on building and customizing conversational AI applications, offering extensive support for research and development in speech and language domains.

Comparison summary

Optimization Focus: vLLM is specialized for LLM inference optimization, NVIDIA Triton is a general-purpose inference server supporting various models and frameworks, and NVIDIA NeMo is focused on developing and customizing conversational AI models.
Hardware and Framework Support: Triton supports a wide range of frameworks and hardware, optimizing inference across diverse environments. NeMo, while capable of leveraging NVIDIA’s hardware optimizations, is more focused on the model training and customization aspect, particularly for conversational AI.
Target Audience: vLLM targets developers needing efficient LLM deployment; Triton appeals to teams deploying a variety of models in scalable production settings; NeMo is aimed at researchers and developers building state-of-the-art conversational systems.

Details of vLLM PagedAttention.

What Are Keys and Values in PagedAttention?

In the context of transformer-based Large Language Models (LLMs), keys (K) and values (V) are components of the attention mechanism used during inference.

Keys (K): Represent encoded representations of previous tokens, used to determine how much attention each token should pay to previous tokens.
Values (V): Contain the actual information used to generate the next token, weighted based on attention scores.

PagedAttention manages these key-value (KV) caches efficiently to store past token embeddings so the model doesn’t have to recompute them in every step, drastically speeding up inference.

Concrete Example: Key-Value Pairs in Action

Let’s take a simple example where an LLM is generating text based on a prompt.

Example Prompt:

User: "The capital of France is"

Tokenized Version (Using Byte-Pair Encoding or SentencePiece):

["The", "capital", "of", "France", "is"]

Each token gets embedded into a high-dimensional space (e.g., 4096 dimensions for LLaMA-2-70B). Let’s assume we use 4096-dimension embeddings for simplicity.

Step-by-Step Key-Value Storage

The model encodes each token and stores:
- Key (K): A vector that helps determine how relevant this token is in future attention computations.
- Value (V): The actual contextual representation of the token.

Token	Key (K) (Simplified)	Value (V) (Simplified)
“The”	`[0.1, 0.2, -0.3, ...]`	`[0.5, 0.4, -0.1, ...]`
“capital”	`[0.2, 0.3, 0.1, ...]`	`[0.6, 0.2, -0.3, ...]`
“of”	`[-0.1, 0.2, 0.7, ...]`	`[0.2, 0.1, 0.9, ...]`
“France”	`[0.5, -0.2, 0.1, ...]`	`[0.7, 0.3, -0.2, ...]`
“is”	`[0.3, 0.1, 0.4, ...]`	`[0.8, 0.2, -0.5, ...]`

When generating the next token (“Paris”), the model:
- Computes attention scores between “Paris” and all previous tokens using dot product of queries (Q) and keys (K).
- Uses the weighted sum of values (V) to form the new representation.
Instead of recomputing attention from scratch, PagedAttention retrieves precomputed (K, V) values from memory pages for fast lookup.

How PagedAttention Optimizes Key-Value Caching

Without PagedAttention: Each request would store KV pairs in one long, contiguous memory buffer. If a request finishes early, the allocated space is wasted.
With PagedAttention: KV pairs are stored in small pages (e.g., chunks of 16 tokens), allowing efficient reuse and minimizing fragmentation.

AI Risks Repository from MIT

August 19, 2024November 27, 2024 · Leave a comment ·

On the topic of governance of AI, here’s a comprehensive listing of AI Risks from MIT with over 700 risks in 7 domains, and extracted from 43 existing frameworks.

https://www.csail.mit.edu/news/global-ai-adoption-outpacing-risk-understanding-warns-mit-csail

https://airisk.mit.edu/

https://sloanreview.mit.edu/article/ai-related-risks-test-the-limits-of-organizational-risk-management/

Statement: Organizations are sufficiently expanding risk management capabilities to address AI-related risks.

Sizing an LLM for GPU memory

July 7, 2024October 7, 2024 · Leave a comment ·

When choosing the EC2 instance for a Large Language Model, one of the first constraints is whether the model will fit in the GPU memory of an instance.

Given a choice of a model, the decisions roughly follow this path –

Model -> Training/Inferencing -> Technique (choice of optimization) -> Memory requirement -> Instance requirement -> Instance availability -> smaller instance or more optimization or distributed training.

Some extreme optimizations are possible such as QLora for Inferencing . See the blog How to fit a layer in memory at a time https://huggingface.co/blog/lyogavin/airllm . However many use cases do not want any sacrifices in accuracy.

Distributed training by splitting the model against smaller instances is another possibility. A discussion is here – https://siboehm.com/articles/22/pipeline-parallel-training

Here’s a listing of different GPU instance types with a column for GPU Memory (GiB) on one page to facilitate instance comparisons.

EC2 G3 Instance Details

Name	GPUs	vCPU	Memory (GiB)	GPU Memory (GiB)	*Price/hr (Linux)**	Price/hr* (Windows)	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
g3s.xlarge	1	4	30.5	8	$0.75	$0.93	$0.525	$0.405
g3.4xlarge	1	16	122	8	$1.14	$1.876	$0.741	$0.538
g3.8xlarge	2	32	244	16	$2.28	$3.752	$1.482	$1.076
g3.16xlarge	4	64	488	32	$4.56	$7.504	$2.964	$2.152

EC2 G4 Instance details

	Instance Size	GPU	vCPUs	Memory (GiB)	Instance Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On-Demand Price/hr*	*1-yr Reserved Instance Effective Hourly (Linux)**	*3-yr Reserved Instance Effective Hourly (Linux)**
G4dn
Single GPU VMs	g4dn.xlarge	1	4	16	1 x 125 NVMe SSD	Up to 25	Up to 3.5	$0.526	$0.316	$0.210
	g4dn.2xlarge	1	8	32	1 x 225 NVMe SSD	Up to 25	Up to 3.5	$0.752	$0.452	$0.300
	g4dn.4xlarge	1	16	64	1 x 225 NVMe SSD	Up to 25	4.75	$1.204	$0.722	$0.482
	g4dn.8xlarge	1	32	128	1 x 900 NVMe SSD	50	9.5	$2.176	$1.306	$0.870
	g4dn.16xlarge	1	64	256	1 x 900 NVMe SSD	50	9.5	$4.352	$2.612	$1.740

Multi GPU VMs	g4dn.12xlarge	4	48	192	1 x 900 NVMe SSD	50	9.5	$3.912	$2.348	$1.564
Multi GPU VMs	g4dn.metal	8	96	384	2 x 900 NVMe SSD	100	19	$7.824	$4.694	$3.130
G4ad
Single GPU VMs	g4ad.xlarge	1	4	16	1 x 150 NVMe SSD	Up to 10	Up to 3	$0.379	$0.227	$0.178
	g4ad.2xlarge	1	8	32	1 x 300 NVMe SSD	Up to 10	Up to 3	$0.541	$0.325	$0.254
	g4ad.4xlarge	1	16	64	1 x 600 NVMe SSD	Up to 10	Up to 3	$0.867	$0.520	$0.405

Multi GPU VMs	g4ad.8xlarge	2	32	128	1 x 1200 NVMe SSD	15	3	$1.734	$1.040	$0.810
Multi GPU VMs	g4ad.16xlarge	4	64	256	1 x 2400 NVMe SSD	25	6	$3.468	$2.081	$1.619

EC2 G5 instance details

	Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g5.xlarge	1	24	4	16	1×250	Up to 10	Up to 3.5	$1.006	$0.604	$0.402
	g5.2xlarge	1	24	8	32	1×450	Up to 10	Up to 3.5	$1.212	$0.727	$0.485
	g5.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.624	$0.974	$0.650
	g5.8xlarge	1	24	32	128	1×900	25	16	$2.448	$1.469	$0.979
	g5.16xlarge	1	24	64	256	1×1900	25	16	$4.096	$2.458	$1.638

Multi GPU VMs	g5.12xlarge	4	96	48	192	1×3800	40	16	$5.672	$3.403	$2.269
	g5.24xlarge	4	96	96	384	1×3800	50	19	$8.144	$4.886	$3.258
	g5.48xlarge	8	192	192	768	2×3800	100	19	$16.288	$9.773	$6.515

EC2 G6 instance details

	Instance Size	GPU	GPU Memory (GB)	vCPUs	Memory (GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On Demand Price/hr*	1-yr ISP Effective Hourly (Linux)	3-yr ISP Effective Hourly (Linux)
Single GPU VMs	g6.xlarge	1	24	4	16	1×250	Up to 10	Up to 5	$0.805	$0.499	$0.342
	g6.2xlarge	1	24	8	32	1×450	Up to 10	Up to 5	$0.978	$0.606	$0.416
	g6.4xlarge	1	24	16	64	1×600	Up to 25	8	$1.323	$0.820	$0.562
	g6.8xlarge	1	24	32	128	2×450	25	16	$2.014	$1.249	$0.856
	g6.16xlarge	1	24	64	256	2×940	25	20	$3.397	$2.106	$1.443
	Gr6 instances with 1:8 vCPU:RAM ratio
	gr6.4xlarge	1	24	16	128	1×600	Up to 25	8	$1.539	$0.954	$0.654
	gr6.8xlarge	1	24	32	256	2×450	25	16	$2.446	$1.517	$1.040

Multi GPU VMs	g6.12xlarge	4	96	48	192	4×940	40	20	$4.602	$2.853	$1.955
	g6.24xlarge	4	96	96	384	4×940	50	30	$6.675	$4.139	$2.837
	g6.48xlarge	8	192	192	768	8×940	100	60	$13.35	$8.277	$5.674

EC2 G6e instances

Instance Size	GPU	GPU Memory (GiB)	vCPUs	Memory(GiB)	Storage (GB)	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
g6e.xlarge	1	48	4	32	250	Up to 20	Up to 5
g6e.2xlarge	1	48	8	64	450	Up to 20	Up to 5
g6e.4xlarge	1	48	16	128	600	20	8
g6e.8xlarge	1	48	32	256	900	25	16
g6e.16xlarge	1	48	64	512	1900	35	20
g6e.12xlarge	4	192	48	384	3800	100	20
g6e.24xlarge	4	192	96	768	3800	200	30
g6e.48xlarge	8	384	192	1536	7600	400	60

EC2 P3 instance details

Instance Size	GPUs – Tesla V100	GPU Peer to Peer	GPU Memory (GB)	vCPUs	Memory (GB)	Network Bandwidth	EBS Bandwidth	On-Demand Price/hr*	1-yr Reserved Instance Effective Hourly*	3-yr Reserved Instance Effective Hourly*
p3.2xlarge	1	N/A	16	8	61	Up to 10 Gbps	1.5 Gbps	$3.06	$1.99	$1.05
p3.8xlarge	4	NVLink	64	32	244	10 Gbps	7 Gbps	$12.24	$7.96	$4.19
p3.16xlarge	8	NVLink	128	64	488	25 Gbps	14 Gbps	$24.48	$15.91	$8.39
p3dn.24xlarge	8	NVLink	256	96	768	100 Gbps	19 Gbps	$31.218	$18.30	$9.64

EC2 P4 instance details

Instance Size	vCPUs	Instance Memory (GiB)	GPU – A100	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (GB)	EBS Bandwidth (Gbps)	On-demand Price/hr	1-yr Reserved Instance Effective Hourly *	3-yr Reserved Instance Effective Hourly *
p4d.24xlarge	96	1152	8	320 GB HBM2	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$32.77	$19.22	$11.57
p4de.24xlarge (preview)	96	1152	8	640 GB HBM2e	400 ENA and EFA	Yes	600 GB/s NVSwitch	8 x 1000 NVMe SSD	19	$40.96	$24.01	$14.46

EC2 P5 instance details

Instance Size	vCPU	Instance Memory (TiB)	GPU – H100	GPU Memory	Network Bandwidth	GPUDirectRDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5.48xlarge	192	2	8	640 GB HBM3	3200 Gbps EFAv2	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

EC2 P5e instance details

Instance Size	vCPUs	Instance Memory (TiB)	GPU	GPU memory	Network Bandwidth (Gbps)	GPUDirect RDMA	GPU Peer to Peer	Instance Storage (TB)	EBS Bandwidth (Gbps)
p5e.48xlarge	192	2	8 x NVIDIA H200	1128 GB HBM3e	3200 Gbps EFA	Yes	900 GB/s NVSwitch	8 x 3.84 NVMe SSD	80

Relevant links

P5e and P5en announcement (update Sep’24). https://aws.amazon.com/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

Use of Triton and NIM to make use of GPU memory across multiple GPUs on an instance –

https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow

https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nims-on-amazon-eks

FP4 and four bit integer quantization, and QLoRA

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA at https://huggingface.co/blog/4bit-transformers-bitsandbytes

[2305.14152] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Note: Performance is not just about GPU memory but also network bandwidth which is needed to load the large models especially for a platform serving multiple models.

When comparing the importance of high memory bandwidth between training and inference for Large Language Models (LLMs), it is generally more critical for training. Here’s why:

1. Training LLMs

Data Movement: Training LLMs involves frequent data movement between the GPU memory and the processing units. Each training iteration requires loading large batches of data, performing extensive matrix multiplications, and updating weights, all of which are memory-intensive operations.
Backward Pass: During the training phase, the backward pass (gradient computation and backpropagation) is highly memory bandwidth-intensive. The gradients of each layer are computed and propagated back through the network, requiring significant memory access.
Parameter Updates: High memory bandwidth is essential to handle the large volume of data being read and written during the parameter updates across multiple layers, especially in very deep models.
Larger Models and Datasets: Training large models like GPT-3 or GPT-4 involves massive datasets and millions (or even billions) of parameters, leading to a substantial demand for memory bandwidth.

2. Inferencing of LLMs:

Data Movement: During inference, the primary task is to process input data and generate outputs, which involves reading the model parameters and performing computations. While this still requires good memory bandwidth, the demands are generally lower compared to training.
No Backpropagation: Inference does not involve the backward pass or parameter updates, significantly reducing the need for continuous memory writes. The absence of gradient computations and updates reduces the overall memory bandwidth requirements.
Smaller Batch Sizes: Inference typically operates on smaller batch sizes compared to training, further reducing the demand for memory bandwidth.
Optimizations: Techniques such as model quantization and optimized inference runtimes (like TensorRT) can reduce the memory bandwidth required during inference by optimizing how data is accessed and processed.

SageMaker Hyperpod for Distributed Model Training

June 2, 2024August 21, 2024 · Leave a comment ·

Amazon SageMaker HyperPod is a new infrastructure designed specifically for distributed training at scale. It offers a purpose-built, high-performance environment that accelerates the training of large machine learning models by optimizing resource allocation, reducing communication overhead, and providing seamless scaling. HyperPod integrates with SageMaker to simplify complex training workflows, making it easier for users to efficiently train foundation models and other large-scale ML workloads. This innovation supports faster iteration and development of AI models. https://aws.amazon.com/sagemaker/hyperpod , https://aws.amazon.com/blogs/machine-learning/introducing-amazon-sagemaker-hyperpod-to-train-foundation-models-at-scale

Perplexity, a generative AI startup, improved its model training speed by 40% using Amazon SageMaker HyperPod on AWS. By leveraging advanced distributed training capabilities and EC2 instances, Perplexity optimized its model training and inference processes. This allowed the company to efficiently handle over 100,000 queries per hour with low latency and high throughput, enhancing user experiences and enabling rapid iteration in AI development. https://aws.amazon.com/solutions/case-studies/perplexity-case-study

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-cluster-observability.html

LLM optimization – PEFT, LORA, QLORA

May 19, 2024August 21, 2024 · Leave a comment ·

Large Language Models (LLMs) have transformed natural language processing, but their immense size and computational demands pose significant challenges. Optimizing these models is crucial for efficient deployment, particularly in resource-constrained environments. Below, we explore several optimization techniques, including Parameter-Efficient Fine-Tuning (PEFT), Low-Rank Adaptation (LoRA), and Quantized Low-Rank Adaptation (QLoRA), highlighting their unique benefits and differences.

1. Parameter-Efficient Fine-Tuning (PEFT)

PEFT is designed to reduce the computational burden of fine-tuning large models by updating only a small subset of the model’s parameters, rather than the entire model. This approach allows for significant resource savings while maintaining performance, making it particularly useful for adapting LLMs to new tasks with limited data or compute resources.

Key Features:

Selective Parameter Update: Only a fraction of the model’s parameters are fine-tuned.
Efficiency: Reduces the computational cost and memory footprint during fine-tuning.
Flexibility: Can be applied across various LLM architectures.

2. Low-Rank Adaptation (LoRA)

LoRA is a technique that further reduces the number of parameters to be updated during fine-tuning by decomposing the model’s weight matrices into low-rank components. By introducing low-rank matrices that are trained alongside the existing weights, LoRA enables fine-tuning with minimal additional parameters, preserving the original model’s architecture.

Key Features:

Low-Rank Decomposition: Decomposes weights into low-rank matrices to minimize parameter updates.
Minimal Overhead: Adds only a small number of trainable parameters.
Performance: Maintains or even enhances model performance on specific tasks.

3. Quantized Low-Rank Adaptation (QLoRA)

QLoRA combines quantization and LoRA to maximize memory and computational efficiency. By quantizing the low-rank matrices, QLoRA reduces the precision of these components, allowing for even greater reductions in memory usage and computational costs without a significant loss in accuracy.

Key Features:

Quantization: Reduces precision of low-rank matrices to lower memory usage.
Memory Efficiency: Significantly decreases the memory required for fine-tuning.
Scalability: Ideal for large-scale deployments where memory is a critical concern.

Contrasting PEFT, LoRA, and QLoRA

Parameter Update Strategy:
- PEFT: Updates a small subset of existing parameters.
- LoRA: Introduces additional low-rank matrices for parameter updates.
- QLoRA: Combines low-rank matrices with quantization for extreme memory efficiency.
Memory and Computational Requirements:
- PEFT: Reduces overall fine-tuning costs but may still require substantial memory.
- LoRA: Further reduces memory usage by minimizing the number of updated parameters.
- QLoRA: Offers the most memory efficiency by applying quantization to the low-rank matrices.
Application Scenarios:
- PEFT: Suitable for fine-tuning in environments with limited compute resources.
- LoRA: Ideal for scenarios requiring efficient fine-tuning with minimal parameter overhead.
- QLoRA: Best for large-scale deployments where memory efficiency is paramount.

Direct Preference Optimization (DPO) vs RLHF/PPO (Reinforcement Learning with Human Feedback, Proximal Policy Optimization)

February 25, 2024July 8, 2024 · Leave a comment ·

The paper “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” introduces Direct Preference Optimization (DPO), an algorithm for fine-tuning language models to align with human preferences without the need for complex reinforcement learning procedures. This simplifies Reinforcement Learning with Human Feedback (RLHF) by not requiring a time consuming human feedback loop in training of the model.

Directly Modified Reward Function : DPO uses human preferences to directly modify the reward function, employing a classification loss to align the model outputs with these preferences. Rather than relying solely on reward signals from the environment, it leverages comparisons or preferences between different trajectories to guide the learning process. The agent is provided with pairs of trajectories along with a preference indicating which trajectory is preferred. This preference data is used to train the policy. The task of predicting preferences can be framed as a binary classification problem. For a given pair of trajectories the model needs to predict which path is preferred. The classification loss then measures the discrepancy between the predicted and actual preferences. A common choice for this kind of binary classification is the binary cross-entropy loss. The overall training objective in DPO involves minimizing the classification loss across all pairs of trajectories in the dataset, which encourages the policy to produce trajectories that align with the observed preferences.

RLHF and Proximal Policy Optimization: RLHF trains a reward model using PPO and data gathered on human preferences that is labeled by humans. These RLHF steps are shown in the diagram below, from the RLHF paper. PPO indirectly learns the reward function through interactions with the environment and optimizes the policy to maximize this reward, using a reinforcement learning framework. The policy here is a mapping from states to a probability distribution over actions.

So Direct Preference Optimization (DPO) modifies the reward function using human preference data. Here is a high-level overview of the equations used:

Preference Model:
- Let θ be the parameters of the model.
- Let τ1 and τ2 be two trajectories (or outputs) being compared.
- The preference model P(τ1≻τ2∣θ) indicates the probability that humans prefer τ1 over τ2.
Logistic Function for Preferences:
- The preference probability is modeled using a logistic function:P(τ1≻τ2∣θ)=exp⁡(R(τ1∣θ)) / ( exp⁡(R(τ1∣θ)) + exp⁡(R(τ2∣θ)) )
- R(τ∣θ) is the reward function for trajectory τ.
Loss Function:
- The loss function L(θ) is defined as the negative log-likelihood of the human preferences:L(θ)=−∑(τ1,τ2)∈D log⁡ P(τ1≻τ2∣θ)
- D is the dataset of human preference comparisons.
Optimization:
- The model parameters θ are optimized by minimizing the loss function L(θ)

GPU kernel functions for deep learning

December 17, 2023July 4, 2024 · Leave a comment ·

This article attempts to outline GPU Kernel Functions and how they are supported in TensorFlow, PyTorch, and OpenAI Triton. GPU Kernel Functions are specialized functions executed on an Nvidia Graphics Processing Unit. These functions play a key role in for parallel and accelerated computing such as tensor matrix operations used in deep learning.

GPU kernel functions for operations commonly used in deep learning include:

Element-wise operations: TensorFlow provides GPU kernels for element-wise operations such as addition, subtraction, multiplication, and division, enabling efficient computation on arrays or tensors.
Matrix operations: GPU kernels in TensorFlow optimize matrix operations like matrix multiplication, matrix addition, and matrix transpose, which are fundamental in many deep learning models.
Convolutional operations: TensorFlow implements GPU kernels for convolutional operations, which are essential for tasks like image recognition and computer vision.
Reduction operations: TensorFlow provides GPU kernels for reduction operations like summation, mean, maximum, and minimum, allowing efficient computation over large arrays or tensors.
Activation functions: GPU kernels are implemented for common activation functions used in deep learning, such as ReLU (Rectified Linear Unit), sigmoid, and tanh.
Pooling operations: TensorFlow’s GPU kernels optimize pooling operations like max pooling and average pooling, commonly used in convolutional neural networks (CNNs).
Recurrent operations: TensorFlow provides GPU kernels for recurrent operations like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit), which are widely used in sequence-based models.

TensorFlow optimizes the execution of operations within a computation graph. When operations can be executed on a GPU, TensorFlow translates the high-level operations into CUDA calls that invoke the corresponding GPU kernels.

PyTorch is another popular open-source deep learning framework that provides a high-level programming interface for building and training machine learning models.

PyTorch differs from TensorFlow in a few ways:

Dynamic Computational Graph: PyTorch uses a dynamic computational graph approach, whereas TensorFlow uses a static computational graph. This means that in PyTorch, the computational graph is constructed and executed on the fly as the code is executed, allowing for more flexibility and dynamic behavior during model training and inference.
Imperative Programming: PyTorch follows an imperative programming style, which allows users to write code that is more intuitive and resembles standard Python programming. This makes it easier to understand and debug the code, as well as experiment with different model architectures and algorithms.
Autograd: PyTorch’s autograd system allows automatic differentiation, which enables computing gradients for model parameters. This makes it easier to implement and train complex models, as users don’t have to manually compute gradients. TensorFlow, on the other hand, uses a static graph approach where gradients need to be explicitly defined and computed.
TorchScript: PyTorch provides a feature called TorchScript, which allows models to be serialized and optimized for deployment in production environments. TorchScript enables efficient execution of PyTorch models on various platforms, including GPUs, CPUs, and mobile devices.

Like TensorFlow, PyTorch also implements GPU kernel functions for efficient computation on GPUs. It implements optimized GPU kernels similar to TensorFlow.

So while both TensorFlow and PyTorch provide GPU kernel function abstractions, their underlying computational graph models and programming styles differ, bringing their own unique advantages and trade-offs.

OpenAI Triton is a programming framework developed by OpenAI for building and deploying large-scale machine learning models efficiently. It leverages TensorFlow as its backend, supporting a wide range of models including deep learning and traditional algorithms. Triton offers tools for distributed computing, automated hyperparameter tuning, and model serving. It simplifies model deployment and management, making it suitable for both research and production environments. Triton abstracts away the need for users to write low-level GPU kernel functions by using TensorFlow’s optimized GPU operations implemented with CUDA, NVIDIA’s parallel computing platform. This approach allows developers to focus on defining high-level machine learning models without worrying about GPU optimization details.

It’s worth noting that Triton is built on top of TensorFlow, which supports alternative GPU acceleration libraries through backend integrations, and this enables Triton to leverage these alternatives to CUDA. One such alternative to CUDA is ROCm (Radeon Open Compute platform), developed by AMD. ROCm is an open-source GPU computing platform that provides support for AMD GPUs. TensorFlow has been working on integrating with ROCm, allowing it to utilize AMD GPUs for deep learning computations. As Triton relies on TensorFlow, it can benefit from this integration to support AMD GPUs through ROCm.

TorchScript for Model Optimization and Model Serving

November 12, 2023July 2, 2024 · Leave a comment ·

TorchScript is an intermediate representation of a PyTorch model that can be optimized and run in a non-Python environment, making the PyTorch model suitable for deployment. It is part of the PyTorch ecosystem (Intro_to_TorchScript_tutorial.html , TorchScript JIT.html ).

Why is TorchScript needed ? Python while excellent for ML model development ( interpreted, REPL, simplicity, integration with number of ML libraries), also has characteristics that make it less suitable for model production deployments. These characteristics include interpretation overheads, complex dependency management, high memory/CPU overheads and the lack of easy integration with native technologies such as C++ for high performance and for embedded systems. TorchScript provides tools for optimizations such as operator fusion and static graph analysis which can improve the efficiency and performance during inference. Optimizing the models is crucial for embedded systems with limited resources.

PyTorch had introduced eager/dynamic execution, which had the advantage of faster user feedback but the disadvantage of not having as many optimizations as were possible in static approaches as in Tensorflow.

A blog on Key points to grasp about TorchScript – https://medium.com/@hihuaweizhu/key-points-to-grasp-for-torchscript-beginners-c02cf94aaa50, makes several good points, including that TorchScript is a subset of PyTorch and consists of statically typed variables.

A discussion between eager mode and script mode at https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff suggests the benefit of TorchScript is more about dev/production (versus training/inference), with the production version requiring performance optimizations and portability. Quote: “With TorchScript, PyTorch aims to create a unified framework from research to production. TorchScript will take your PyTorch modules as input and convert them into a production-friendly format.“

NVIDIA uses TorchScript to facilitate the deployment and optimization of PyTorch models within their ecosystem. The Torchscript models are compiled to TensorRT, the Nvidia runtime .

AWS ML software stack, Neuron, supports tracing in torchscript. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html . https://pytorch.org/docs/master/generated/torch.jit.trace.html#torch.jit.trace . An example of a neuron sdk trace for pytorch – https://github.com/aws-neuron/aws-neuron-sdk/issues/371 .

PyTorch/XLA is another project that integrates with Google XLA compiler to enable running PyTorch models on Google TPUs.

GraphCore produces hardware for deep learning called a GraphCore Intelligence Processing Unit (IPU). The primary software framework provided by GraphCore to execute machine learning models on their IPUs is Poplar. It allows running models from TensorFlow and PyTorch. Poplar optimizes computations for the unique architecture of GraphCore’s IPUs. This includes optimizations for memory bandwidth, parallel processing, and other hardware-specific features.

AlphaFold for protein structure prediction with deep learning – how does it work

September 12, 2023August 23, 2024 · Leave a comment ·

AlphaFold is a deep learning model developed by DeepMind that predicts protein structure. It uses a two-step process: First, it generates a representation of the protein’s amino acid sequence. Then, it refines this representation to predict the 3D structure of the protein. The model is trained on a large database of known protein structures and uses a neural network architecture called a convolutional neural network (CNN) to make these predictions. It leverages the concept of attention mechanisms to incorporate information from multiple parts of the protein sequence during the prediction process. It combines advanced machine learning techniques with protein structure data to make accurate predictions about protein folding.

Attention mechanisms are a key component of AlphaFold and play a crucial role in capturing dependencies between different parts of a protein sequence. To understand attention mechanisms, let’s break it down step by step.

Embedding the Protein Sequence:
AlphaFold starts by embedding the amino acid sequence of a protein into a numerical representation. Each amino acid is represented as a vector, and these vectors are combined to form the input sequence matrix, X ∈ ℝ^(L×D), where L is the length of the sequence and D is the dimensionality of each amino acid vector.
Creating Query, Key, and Value Matrices:
AlphaFold then generates three matrices – Query (Q), Key (K), and Value (V) – by linearly transforming the input sequence matrix X. This transformation is performed using learnable weight matrices WQ, WK, and WV. The resulting matrices are Q = XWQ, K = XWK, and V = XWV, each having dimensions of L×D.
Calculating Attention Weights:
The attention mechanism computes the similarity between each query vector and key vector by taking their dot products. This similarity is scaled by a factor of √(D), and a softmax function is applied to obtain attention weights. The attention weights determine how much each key contributes to the final output. Let’s denote the attention weights matrix as A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key. The attention weights are calculated as follows: A_ij = softmax((Q_i ⋅ K_j) / √(D)) Here, Q_i represents the i-th row of the Query matrix, and K_j represents the j-th row of the Key matrix.
Weighted Sum of Values:
The final step is to compute the weighted sum of the Value matrix using the attention weights. This is done by taking the matrix multiplication of attention weights A and the Value matrix V. The resulting matrix C, representing the context or attended representation, is given by: C = AV The context matrix C has dimensions of L×D, where each row represents a weighted sum of the Value vectors based on the attention weights.

Attention mechanisms in AlphaFold allow the model to capture the relationships and dependencies between different parts of the protein sequence. By assigning attention weights to relevant amino acids, the model can focus on important regions during the prediction process, enabling accurate protein structure predictions.

The dimensions of the matrices involved are as follows:

Input Sequence Matrix (X): X ∈ ℝ^(L×D), where L is the length of the protein sequence and D is the dimensionality of each amino acid vector.
Query Matrix (Q): Q ∈ ℝ^(L×D), same as the dimensions of X.
Key Matrix (K): K ∈ ℝ^(L×D), same as the dimensions of X.
Value Matrix (V): V ∈ ℝ^(L×D), same as the dimensions of X.
Attention Weights Matrix (A): A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key.
Context Matrix (C): C ∈ ℝ^(L×D), same as the dimensions of X.

So the matrices Q, K, V, X, and C have dimensions L×D, where L represents the length of the protein sequence and D represents the dimensionality of the amino acid vectors. The attention weights matrix A has dimensions L×L, capturing the attention weights between each query and key pair.

OpenFold is a pytorch based reproduction of alphafold , a comparison is used in https://wandb.ai/telidavies/ml-news/reports/OpenFold-A-PyTorch-Reproduction-Of-DeepMind-s-AlphaFold–VmlldzoyMjE3MjI5

What are amino acids ? Amino acids are organic compounds that serve as the building blocks of proteins. They contain an amino group (-NH2) and a carboxyl group (-COOH) attached to a central carbon atom, along with a specific side chain (R-group). The side chain varies among different amino acids, giving them unique properties.

There are 20 standard amino acids that are commonly found in proteins. Each amino acid has a unique structure and properties, determined by its specific side chain. Some examples include glycine, alanine, valine, leucine, isoleucine, serine, threonine, cysteine, methionine, aspartic acid, glutamic acid, lysine, arginine, histidine, phenylalanine, tyrosine, tryptophan, asparagine, glutamine, and proline.

Amino acids encode proteins through a process called translation. The genetic information stored in DNA is transcribed into messenger RNA (mRNA). The mRNA is then read by ribosomes, which assemble amino acids in a specific sequence according to the instructions provided by the mRNA. This sequence of amino acids forms a polypeptide chain, which then folds into a functional protein with a specific structure and function. The sequence of amino acids in a protein is determined by the sequence of nucleotides in the corresponding mRNA molecule.

Structure of AlphaFold from Nature paper, and as described here.

https://www.forbes.com/sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

https://www.nature.com/articles/s41592-023-01924-w A team of researchers led by Peter Kim at Stanford University has performed guided protein evolution using protein language models that were trained on millions of natural protein sequences.

https://aibusiness.com/nlp/meta-lays-off-team-behind-its-protein-folding-model

https://techcrunch.com/2024/06/25/evolutionaryscale-backed-by-amazon-and-nvidia-raises-142m-for-protein-generating-ai/

https://github.com/evolutionaryscale/esm

https://github.com/aws-samples/drug-discovery-workflows

https://github.com/lucidrains/alphafold3-pytorch

https://github.com/google-deepmind/alphafold

https://fold.it/about_foldit

https://build.nvidia.com/explore/biology

Transformer revisited

September 3, 2023December 24, 2025 · Leave a comment ·

A question driven dive into transformer, mainly to checkpoint my own level of understanding.

Q: What are similarities and differences between autocorrelation used on signals in signal processing and attention used on sequences of word vectors in transformers in deep learning ?

There are some similarities between the attention mechanism and autocorrelation in signal processing, in the sense that both involve computing a similarity measure between a signal and a shifted version of itself.

In autocorrelation, the similarity measure is typically computed as the dot product between the signal and a shifted version of itself. The resulting autocorrelation function can reveal patterns in the signal that are repeated over time, such as periodicity or other types of regularity.

Similarly, in the attention mechanism, the similarity measure between the query and the keys is computed as the dot product, and the resulting attention scores can reveal patterns in the input that are relevant to the query. These patterns can be used to weight the corresponding values when computing the final attention output.

However, there are also some important differences between the two concepts. In signal processing, autocorrelation is used primarily for analyzing signals, while the attention mechanism is used primarily for modeling relationships between elements in a set or sequence. Additionally, the attention mechanism can be more flexible and powerful than autocorrelation, as it can take into account multiple queries and keys simultaneously, and can be applied to a wide range of tasks in natural language processing, computer vision, and other domains.

Q: How are the Q, K, V matrices derived from the input X ? can you confirm that these are functions of X ?

A: The transformer architecture utilizes matrices Q, K, and V which are derived from the input X. The Q, K, and V matrices in the transformer are functions of the input X, derived by applying learned projections. The queries Q represent a lower-dimensional embedding of X optimized for attention, while keys K and values V represent transformations of X optimized for similarity matching and information retrieval. By deriving Q, K, V from X, the transformer architecture can relate elements of the input to one another through scalable self-attention.

Q is derived from X by applying a learned projection to X. “Q is a projection of a learned latent array with index dimension N ≪ M, where the latent’s input is X.” So Q = Projection(X).

K and V are derived by applying projections directly to the input X. Therefore, K = Projection(X) and V = Projection(X).

Q: How are the Q and K matrices incrementally adjusted in the forward and backward passes of back propagation, during training

A: During training with backpropagation:

In the forward pass, the query Q and key K matrices are initialized randomly and then get updated through backpropagation to learn meaningful representations.

In the backward pass, the gradients from the loss function are propagated back through the network to update the Q and K matrices:

The gradients for Q and K are computed based on the chain rule. The loss L is differentiated w.r.t the attention weights A, which in turn depends on Q and K.

dL/dQ = dL/dA * dA/dQ and
dL/dK = dL/dA * dA/dK

These gradients are used to update Q and K with optimization techniques like stochastic gradient descent or Adam.

So in the forward pass Q and K are computed, in the backward pass their gradients are computed and they get updated to minimize the loss. This incremental adjustment via backpropagation allows Q and K to learn optimized representations.

Q: in a transformer with Q, K, V matrices where these are projections of the input X, can you confirm that the associated weight matrices for each transform remain fixed for any input X, that is they are a function of the corpus and training and not an individual input X ?

A: The weight matrices used for these projections remain fixed and are not dependent on any individual input X. The weights are learned during pre-training on the corpus and then fixed after training.

Q: In a transformer, Q, K, V matrices are derived from X as Q = XWQ, K = XWK, and V = XWV . what are the exact dimensions of X, WQ, WK, WV here ?

In the context of the Transformer architecture (as introduced in the “Attention Is All You Need” paper by Vaswani et al.), the matrices QQ, KK, and VV are derived from the input XX using weight matrices WQWQ, WKWK, and WVWV, respectively. The exact dimensions of XX, WQWQ, WKWK, and WVWV are determined by the problem setting and the design choices of the transformer model.

However, for clarity, let’s break down the dimensions based on the original Transformer model:

XX: The dimension of XX is [T,dmodel][T,dmodel], where TT is the sequence length and dmodeldmodel is the dimension of the embeddings (e.g., 512 in the original Transformer model).
WQWQ, WKWK, WVWV: These weight matrices are used to project XX into the corresponding QQ, KK, and VV spaces. In the original Transformer, the attention heads operate in a reduced-dimensional space of size dkdk. So, the dimensions of these matrices are:
- WQWQ: [dmodel,dk][dmodel,dk]
- WKWK: [dmodel,dk][dmodel,dk]
- WVWV: [dmodel,dv][dmodel,dv]

In the case of multi-head attention, there are hh heads, and dmodel=h×dkdmodel=h×dk. Typically in the original Transformer, dk=dvdk=dv and is often set to 6464. Hence, for 8 heads (h=8h=8), dmodel=512dmodel=512.

So, given the above:

QQ will have dimensions [T,dk][T,dk].
KK will have dimensions [T,dk][T,dk].
VV will have dimensions [T,dv][T,dv].

However, keep in mind that in the multi-head attention setting, these projections happen independently for each head, resulting in multiple QQ, KK, and VV matrices – one set for each head.

To decide on the exact dimensions in a specific scenario, one would need to know the exact design choices or configurations of the transformer in question.

Q: So there is a difference between d_model and d_k. how is d_k chosen in that paper ?

A: Yes, there is a distinction between d_model and d_k in the Transformer architecture. In the original “Attention Is All You Need” paper by Vaswani et al., the multi-head attention mechanism is introduced. Each “head” of this multi-head attention operates in a reduced-dimensional space defined by dkdk for queries and keys, and dvdv for values.

In the original Transformer model:

dmodel=512dmodel=512
dk=dv=64dk=dv=64

The model uses h=8h=8 heads, which means:

dmodel=h×dkdmodel=h×dk

So, 512=8×64512=8×64.

The choice of dk=64dk=64 in the paper is a design decision, and it’s likely the result of experimentation to find a balance between computational efficiency and model performance. By having a smaller dkdk, the dot products in the attention mechanism are computed in a reduced-dimensional space, which is more computationally efficient. By using multiple heads, the model can still capture different types of attention patterns in parallel.

The paper does not go into explicit detail about how they arrived at dk=64dk=64 beyond this design rationale, and the choice could be the result of empirical testing and optimization.
User

Q: How are the multiple different transformer units strung together ? There are multiple Q, K, V matrices in a single neural network as described in the paper, right ?

A: Absolutely. The original “Attention Is All You Need” paper introduced the Transformer model, which consists of multiple layers of transformer units, and each of these units employs the multi-head attention mechanism. I’ll walk you through how these units are strung together and how multiple QQ, KK, VV matrices come into play.

Multi-Head Attention:
Within a single transformer unit (or layer), there's the concept of multi-head attention. This means the attention mechanism is applied multiple times (with different weight matrices) in parallel. For the original Transformer, this number is h=8h=8.
For each head, separate QQ, KK, and VV matrices are computed, resulting in hh sets of QQ, KK, VV matrices. These are then used to compute hh different attention outputs.
These hh attention outputs are concatenated and linearly transformed to produce a single output for that transformer layer.

Stacked Transformer Layers:
The Transformer model consists of several stacked layers of the transformer units. In the original paper, they used 6 layers for both the encoder and the decoder. Each of these layers has its own set of weight matrices for computing QQ, KK, and VV, and hence, its own multi-head attention mechanism.
The output from one layer is used as the input to the next layer, allowing for increasingly abstract representations as you move up the layers.

Encoder-Decoder Structure:
The original Transformer model has an encoder-decoder structure. Each of these consists of multiple transformer layers.
The encoder takes in the source sequence and produces a representation. This representation is then used by the decoder (alongside the target sequence) to produce the final output.
In the decoder, there are actually two attention mechanisms in each layer: one that attends to the decoder's own previous outputs (self-attention, like in the encoder) and another that attends to the encoder's output (cross-attention).

Residual Connections & Feed-Forward Networks:
Besides the attention mechanisms, each transformer layer also contains a position-wise feed-forward network and residual connections. The output from the multi-head attention is passed through this feed-forward network before being sent to the next layer.
Residual connections help in training deep networks by bypassing layers with the identity function. This is a crucial aspect of the Transformer's architecture.

To visualize, imagine the encoder as a vertical stack of blocks (transformer layers) where each block has 8 separate “channels” or “pipes” (multi-head attention) running through it. The same visualization applies to the decoder, but with an added complexity of having two attention mechanisms in each layer.

Yes, due to the multi-head attention and multiple layers, there are indeed multiple QQ, KK, and VV matrices in the Transformer model as described in the paper.

Q: Can you count exactly how many Q,K, V matrices in total are in the above description ? think step by step

A: Certainly! Let’s break it down step by step for both the encoder and the decoder:
Encoder:

Multi-Head Attention within a Single Layer:
    Each transformer layer in the encoder has a multi-head attention mechanism with h=8h=8 heads.
    Therefore, for each layer, there are:
        8 QQ matrices
        8 KK matrices
        8 VV matrices

Stacked Transformer Layers:
    The encoder consists of 6 stacked transformer layers.
    For all layers in the encoder:
        QQ: 6×8=486×8=48
        KK: 6×8=486×8=48
        VV: 6×8=486×8=48

Decoder:

Self-Attention within a Single Layer:
    Similar to the encoder, each transformer layer in the decoder has a self-attention mechanism with h=8h=8 heads.
    For each layer:
        8 QQ matrices for self-attention
        8 KK matrices for self-attention
        8 VV matrices for self-attention

Cross-Attention to Encoder Output within a Single Layer:
    Each transformer layer in the decoder also has a cross-attention mechanism that attends to the encoder's output. This also has h=8h=8 heads.
    For each layer:
        8 QQ matrices for cross-attention (from the decoder)
        8 KK matrices for cross-attention (from the encoder)
        8 VV matrices for cross-attention (from the encoder)

Stacked Transformer Layers in the Decoder:
    The decoder also consists of 6 stacked transformer layers.
    For all layers in the decoder (considering both self-attention and cross-attention):
        QQ: 6×8+6×8=966×8+6×8=96 (48 from self-attention and 48 from cross-attention)
        KK: 6×8+6×8=966×8+6×8=96
        VV: 6×8+6×8=966×8+6×8=96

Total Count:

Adding up both the encoder and the decoder:

Total QQ matrices: 48(fromencoder)+96(fromdecoder)=14448(fromencoder)+96(fromdecoder)=144
Total KK matrices: 48+96=14448+96=144
Total VV matrices: 48+96=14448+96=144

So, in total, across the entire Transformer model (encoder and decoder combined) as described in the original paper, there are:

144 QQ matrices
144 KK matrices
144 VV matrices

Q: Why do they use XWQ instead of the more typical linear transform Wq.X ?

A:
The notation “XW^Q” versus “W^QX” is primarily a matter of convention, and the underlying mathematical operation remains the same. Both notations represent matrix multiplication, but the order of the operands is swapped. The choice of one over the other can be influenced by multiple factors, such as:

Data Representation: In some frameworks or notations, data instances are represented as rows in a matrix (often called "batch-first" convention). In others, data instances are columns (often called "sequence-first" or "time-first" convention). The choice of convention can dictate the order of the matrix multiplication to ensure the dimensions align correctly.

Historical and Cultural Preferences: Some fields or communities have preferred conventions based on historical precedent, educational materials, or influential publications.

Computational Efficiency: Depending on the data storage, memory access patterns, and the specific operations used, one form might be more efficient than the other. This is more of a concern when optimizing at a low level or for specific hardware.

In the “Attention Is All You Need” paper and many deep learning publications, the “batch-first” convention is used, which means input data X is represented with batches as rows. In this format, XW^Q makes sense because each row (data instance) of X is being linearly transformed by W^Q.

On the other hand, if you’re coming from a background where column vectors represent data instances (e.g., traditional signal processing or control systems), the notation W^QX might feel more natural.

It’s essential to be flexible in understanding both notations and to always check the context in which they’re used.

Q: what do you mean by input data X is represented with batches as rows. is each row an embedding vector representing a word ?

A: Yes, I apologize for not being clear. You’re correct. In the context of the Transformer architecture and many deep learning models, when we talk about data XX, it often refers to a matrix where:

Each row corresponds to an embedding vector of a token (e.g., a word or subword).
The number of rows represents the sequence length (i.e., the number of tokens in a given sequence).

For instance, let’s say you’re processing a sentence with 10 words, and you’re using an embedding dimension of 512. Your input XX for that sentence would be a 10×51210×512 matrix. Here, each of the 10 rows is a 512-dimensional embedding vector corresponding to each of the 10 words.

Now, when we introduce batching into the mix:

A batch consists of multiple such sequences (sentences) processed together for computational efficiency.
If you have a batch of NN sentences, and each sentence (sequence) has been padded or truncated to have a length of TT tokens, your batched input data would be a N×T×512N×T×512 tensor.

However, when dealing with linear transformations in the context of the Transformer’s attention mechanism, we’re often focused on one sequence at a time (or a batch of sequences but processed in parallel), so we can think of XX as a T×512T×512 matrix, with TT being the sequence length and 512 being the embedding dimension.

The notion “XW^Q” then makes sense in this context, as each row (token embedding) in XX is transformed by the matrix WQWQ to produce the query matrix QQ.

Q: Can you give some reference papers for applications of group theory in deep learning

A: Absolutely! Here’s a list of notable papers that touch upon the intersection of group theory and deep learning:

Equivariant Neural Networks:
- “Group Equivariant Convolutional Networks” by Taco S. Cohen and Max Welling. This paper is foundational for introducing group equivariance into CNNs.
  - Link to paper
- “Steerable CNNs” by Taco S. Cohen et al. This work dives into orientation and scale-equivariant CNNs using steerable filters.
  - Link to paper
Invariant Neural Networks:
- “Deep Sets” by Manzil Zaheer et al. This paper isn’t about group theory per se, but it deals with the idea of permutation invariance in neural networks, which relates to certain group symmetries.
  - Link to paper
Transformers and Group Theory:
- While the application of group theory to transformers isn’t as well-established as with CNNs, the AlphaFold 2 paper (as we discussed) can serve as a reference for equivariant attention mechanisms.
  - Link to AlphaFold 2 paper
General Understanding and Overview:
- “A General Theory of Equivariant CNNs on Homogeneous Spaces” by Taco S. Cohen et al. provides a broader understanding of the topic.
  - Link to paper
Applications in Molecular Biology:
- “Equivariant Networks for Molecular Property Prediction” by Bowen Cui, Ivan Zhang, and Zachary Teed explores the applications of group equivariance in predicting molecular properties.
  - Link to paper
For More Advanced Topics and Generalizations:
- “Gauge Equivariant Convolutional Networks and the Icosahedral CNN” by Taco S. Cohen et al. explores more advanced structures and symmetries.
  - Link to paper

While this list provides a good starting point, it’s essential to note that the intersection of group theory and deep learning is an active research area, with new contributions regularly. Keeping an eye on major AI conferences and journals will reveal the latest advancements in this domain.

Deep Reinforcement Learning key papers

July 2, 2023July 2, 2024 · Leave a comment ·

Reinforcement Learning (RL) combined with Deep Learning has been termed Deep Reinforcement Learning (DRL). Deep learning provides function approximation techniques that can handle large and complex state and/or action spaces, making it possible to tackle problems that were infeasible with traditional RL techniques. This line of research led to transformers and LLMs. Here’s a brief timeline of key insights and breakthroughs in Deep Reinforcement Learning over the past decade:

1. 2013 – Playing Atari with Deep Reinforcement Learning:

Organization: DeepMind
Breakthrough: This was perhaps the first major work that combined deep learning with Q-learning, resulting in a Deep Q-Network (DQN). The DQN was able to play several Atari 2600 games at or above human-level performance.
Key Insights: Experience replay and fixed Q-targets were used to stabilize learning. The experience replay helped in breaking the temporal correlations, and fixed Q-targets reduced the moving target problem in Q-learning.

2. 2015 – Human-level control through deep reinforcement learning:

Organization: DeepMind
Breakthrough: An extension of the 2013 DQN work, this presented a more robust DQN that achieved human-level performance across a broad range of Atari games.
Key Insights: Further stabilization and scaling of DQNs.

3. 2015 – Continuous control with deep reinforcement learning (DDPG):

Organization: DeepMind
Breakthrough: Introduced the Deep Deterministic Policy Gradient (DDPG) algorithm for continuous action spaces.
Key Insights: It utilized actor-critic architecture where the actor produces a deterministic policy, and the critic evaluates it. The Ornstein-Uhlenbeck process was used to add exploration noise.

4. 2016 – Asynchronous Methods for Deep Reinforcement Learning (A3C):

Organization: DeepMind
Breakthrough: Introduced the Asynchronous Advantage Actor-Critic (A3C) algorithm which combined the actor-critic approach with asynchronous updates.
Key Insights: Multiple agents, each with its own set of model parameters, explored different parts of the environment simultaneously, leading to faster and more robust policy learning. The asynchronous nature also helped in stabilizing learning.

5. 2017 – Proximal Policy Optimization (PPO):

Organization: OpenAI
Breakthrough: Introduced a simpler and more robust method for policy gradient optimization, making training more stable.
Key Insights: PPO constrains the policy updates to ensure the new policy isn’t too different from the old policy, thereby avoiding extreme policy updates that can destabilize training. PPO balances the benefits of both Policy Gradient methods (link) and Trust Region Policy Optimization methods (TRPO link). PPO achieves this by using a clipped surrogate objective that prevents large updates during training, enhancing stability and performance. In the context ofPPO, the term “surrogate objective” refers to an approximation used in place of the actual objective function during optimization. This surrogate function is easier to optimize and ensures more stable and reliable updates to the policy. The clip function ensures that the probability ratiodoes not deviate too far from 1 by clipping it to the range [1−ϵ,1+ϵ][1−ϵ,1+ϵ]. This prevents excessively large policy updates.

6. 2018 – Soft Actor-Critic (SAC):

Organization: UC Berkeley
Breakthrough: SAC is an off-policy actor-critic deep RL algorithm based on the maximum entropy RL framework.
Key Insights: SAC seeks policies that maximize both expected return and entropy, leading to more exploration, smoother policy updates, and generally better performance on continuous control tasks.

7. 2019 and beyond:

Subsequent years have seen the evolution of these methods and the introduction of new algorithms, improvements in sample efficiency, stability, and scalability. Also, there has been a focus on:

Transfer Learning: Using pre-trained models to improve sample efficiency in RL.
Meta-learning: Training agents that can quickly adapt to new tasks.
Model-based RL: Incorporating learned models of the environment dynamics to improve sample efficiency and policy learning.

LLM evolution – Anthropic , AI21, Cohere, GPT-4

May 14, 2023June 21, 2023 · Leave a comment ·

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

LLM Inferencing is hard – tools and techniques

April 29, 2023February 12, 2024 · 1 Comment ·

Large Language Models take up a lot of GPU memory with the larger ones exceeding GPU memory sizes. Space is taken up my the model weights as well as by in-memory query specific tensor calculations. Model parallelism to store an LLM across multiple GPUs is both expensive and hard. This makes it important to look at techniques to fit an LLM in a single GPU.

Let’s say the foundation models are available such that no further training is needed and and that one (just) wants to inference against them. Inferencing is not a small challenge, and a number of techniques have been explored. Here’s a link – https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ which discusses

student-teacher knowledge distillation training, leading to DistilBert
quantization, quantization-aware training, post-training quantization
pruning
architectural optimization, efficient transformers

OpenAI link on speeding and scaling LLMs to 100k context windows – https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c

High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on a single GPU.

model compression
collaborative inference
offloading to utilize memory from CPU and disk

They then show 3 contributions

definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
quantization of the parameters to 4 bits with small loss of accuracy
run a OPT-175B model on a single T4 GPU with 16GB memory (!)

PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf says ”expanding the context size leads to a quadratic increase in inference costs“

There are three main classes of PEFT methods:

Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
Selection-based, and
Reparametrization-based.

General strategies for inference concurrency, courtesy chatgpt:

To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.

Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.

Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.

Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.

Efficiently scaling transformer inference – link is a paper from Google discussing partitioning of weights and activations across multiple heads and multiple chips (Nov’22).

Feature Vectors, Embeddings, Vector Databases, Feature Stores

April 8, 2023June 19, 2023 · 1 Comment ·

An ML model consists of a set of weights (or a set of numerical values) that transform inputs to outputs (along with a nonlinear transform such as a sigmoid function). The weights are often organized as vectors or matrices. Consider neural networks, decision trees and support vector machines as types of ML models for this discussion.

The weights representing features of the data (input or intermediate data) are also called feature vectors or vectors. They are also called embeddings, that is embeddings of vectors in a vector space. We discussed such vectors in https://securemachinery.com/2019/05/24/transformer-gpt-2/.

The term “embedding” comes from the idea that the vectors “embed” the original data into a lower-dimensional space. The embedding process involves a combination of statistical and computational techniques, such as factorization and neural networks, that learn to map the input data into the vector space in a way that preserves the relevant properties of the original data.

The use of vectors to represent words in machine learning research started in 2013 with the publication of the paper “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al. This paper introduced the word2vec algorithm, which generates dense vector representations of words based on their distributional properties in a large corpus of text. The size of the vector or embedding in a word embedding model is a hyperparameter that needs to be determined before training the model. It is typically chosen based on the size of the vocabulary and the complexity of the task at hand. In practice, the vector size is often set to be between 100 and 300 dimensions, but this can vary depending on the specific application and the available computational resources. The optimal vector size can be determined through experimentation and tuning of hyperparameters.

One difference between embeddings and feature vectors is that embeddings are typically learned automatically from the data, while feature vectors are typically chosen based on domain knowledge or feature engineering. However these two terms are often used interchangeably. Here is a video going over how the embeddings are obtained from words in a sentence with a bag of words approach- https://www.youtube.com/watch?v=viZrOnJclY0 .

Pinecone, Milvus, Facebook AI Similarity Search (FAISS), Google Vertex Matching engine are examples of Vector databases.

The challenge in implementing a vector database is that traditional databases are not optimized for handling high-dimensional vector data, which is often used in machine learning and data science applications.

Vector data is typically represented as arrays of numbers, where each number represents a feature or attribute of the data. For example, an image might be represented as a high-dimensional vector where each dimension represents the color value of a specific pixel. In contrast to traditional databases, where each record consists of a set of fields or columns, vector databases need to store and index large volumes of high-dimensional data in a way that supports efficient similarity search.

In traditional databases, queries are typically based on simple comparisons of scalar values, such as equality or range queries. However, in vector databases, similarity search is the primary operation, which requires specialized algorithms and data structures to efficiently compute the similarity between vectors. These algorithms are designed to handle high-dimensional data and minimize the amount of computation needed to compare vectors, which can be computationally expensive.

There are several specialized algorithms that are commonly used in vector databases to support efficient similarity search. Here are some examples:

Euclidean Distance: This is a distance metric that measures the straight-line distance between two points in Euclidean space. It is commonly used in vector databases to compute the distance or similarity between vectors.
Cosine Similarity: This is a similarity metric that measures the cosine of the angle between two vectors. It is commonly used in text-based applications to measure the similarity between documents or word embeddings.
Locality-Sensitive Hashing (LSH): This is a technique used to hash high-dimensional vectors into lower-dimensional buckets based on their similarity. It is commonly used in vector databases to speed up similarity search by reducing the number of comparisons needed to find similar vectors.
Product Quantization: This is a technique used to divide high-dimensional vectors into smaller subvectors and quantize them separately. It is commonly used in vector databases to reduce the dimensionality of the data and speed up similarity search.
Inverted Indexing: This is a technique used to index the vectors based on the values of their individual dimensions. It is commonly used in text-based applications to speed up search queries by indexing the terms in the document.

Pinecone provides several indexing and search algorithms, including approximate nearest neighbor search, that are selected automatically based on the properties of the data and the search requirements. However, you can also specify a specific algorithm or tuning parameters when creating an index or performing a query by passing in the appropriate arguments. For example, you can use the method parameter when creating an index to specify the indexing method, or the distance parameter when performing a query to specify the distance metric to use.

While OpenSearch is not specifically designed as a vector database like Pinecone, it provides vector search capabilities through its support for nearest neighbor search. OpenSearch uses the K-Nearest Neighbor (K-NN) algorithm to perform nearest neighbor search for vector data. K-NN is a machine learning algorithm that can be used to find the K nearest neighbors of a query vector in a high-dimensional space. OpenSearch also provides support for approximate nearest neighbor search using algorithms such as Annoy and Hnswlib. To use vector search in OpenSearch, you first need to index your vector data using the appropriate data type (e.g., float or double). You can then perform a nearest neighbor search by specifying the query vector and the number of nearest neighbors to return. OpenSearch also provides support for vector scoring, which allows you to rank search results based on their similarity to a query vector. You can use vector scoring to boost or filter search results based on their similarity to a query vector.

What kind of vectorization schemes are useful for log processing ?

When processing log data, the goal is typically to extract useful information from the log entries and transform them into a format that can be easily analyzed and searched. Vectorization is a common technique used for this purpose, and there are several vectorization schemes that are applicable to log processing. Here are some examples:

Bag-of-words: This is a vectorization scheme that represents a document as a bag of words, where each word is represented by a dimension in the vector and the value of the dimension is the frequency of the word in the document. Bag-of-words can be used to represent log entries as a vector of words, which can be used for tasks such as text classification and anomaly detection.
TF-IDF: This is a vectorization scheme that represents a document as a weighted combination of its term frequency and inverse document frequency. TF-IDF can be used to represent log entries as a vector of weighted words, which can be used for tasks such as information retrieval and text mining.
Word embeddings: This is a vectorization scheme that represents words as dense vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between the words. Word embeddings can be used to represent log entries as a vector of word embeddings, which can be used for tasks such as text classification and entity recognition.
Sequence embeddings: This is a vectorization scheme that represents a sequence of words as a dense vector in a high-dimensional space, where the distance between vectors reflects the similarity between the sequences. Sequence embeddings can be used to represent log entries as a vector of sequence embeddings, which can be used for tasks such as sequence classification and anomaly detection.
One-hot encoding: This is a vectorization scheme that represents categorical data as binary vectors, where each dimension corresponds to a possible category and the value of the dimension is 1 if the data belongs to that category and 0 otherwise. One-hot encoding can be used to represent log entries as a vector of categorical features, which can be used for tasks such as classification and clustering.

By using a suitable vectorization scheme, log data can be transformed into a format that can be easily analyzed and searched, enabling tasks such as anomaly detection, root cause analysis, and performance optimization.

Vector database versus Feature store – what’s the difference ?

Both vector databases and feature stores are used to manage and serve high-dimensional data, such as embeddings, vectors, and other numerical representations, but there are some key differences between the two.

A vector database is a database optimized for storing and querying high-dimensional vector data. It provides efficient indexing and search algorithms, such as approximate nearest neighbor search, that allow for fast and scalable similarity search. Vector databases are commonly used in machine learning applications, such as recommendation systems and natural language processing, where the goal is to find similar items or entities based on their vector representations.

A feature store, on the other hand, is a centralized repository for machine learning features that provides a way to store, manage, and share feature data across different applications and teams. It is designed to help data scientists and machine learning engineers build, test, and deploy machine learning models more efficiently by providing a unified interface for accessing and managing features.

While both vector databases and feature stores can store and serve high-dimensional data, the main difference is their focus and use case. Vector databases are designed for efficient similarity search, while feature stores are designed for feature management and sharing across different applications and teams. In practice, they can complement each other in many machine learning workflows, with the vector database providing the efficient similarity search capabilities and the feature store providing a centralized and standardized way to manage and share feature data.

Comparison of Milvus Pinecone Vespa Weaviate Vald GSI Qdrant – https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696

Anyscale – Using an embeddings database to train an LLM using Ray – https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray

OpenAI embeddings example – https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

HuggingFace sentence embeddings article – https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

AWS – https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8