Category: gpu

LLM Inferencing is hard – tools and techniques

Large Language Models take up a lot of GPU memory with the larger ones exceeding GPU memory sizes. Space is taken up my the model weights as well as by in-memory query specific tensor calculations. Model parallelism to store an LLM across multiple GPUs is both expensive and hard. This makes it important to look at techniques to fit an LLM in a single GPU.

Let’s say the foundation models are available such that no further training is needed and and that one (just) wants to inference against them. Inferencing is not a small challenge, and a number of techniques have been explored. Here’s a link – https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ which discusses

  • student-teacher knowledge distillation training, leading to DistilBert
  • quantization, quantization-aware training, post-training quantization
  • pruning
  • architectural optimization, efficient transformers

OpenAI link on speeding and scaling LLMs to 100k context windows – https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c

High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on a single GPU.

  • model compression
  • collaborative inference
  • offloading to utilize memory from CPU and disk

They then show 3 contributions

  • definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
  • quantization of the parameters to 4 bits with small loss of accuracy
  • run a OPT-175B model on a single T4 GPU with 16GB memory (!)

PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf says ”expanding the context size leads to a quadratic increase in inference costs

There are three main classes of PEFT methods:

  • Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
  • Selection-based, and
  • Reparametrization-based.

General strategies for inference concurrency, courtesy chatgpt:

To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.

Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.

Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.

Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.

Efficiently scaling transformer inference – link is a paper from Google discussing partitioning of weights and activations across multiple heads and multiple chips (Nov’22).

EC2 P5 UltraClusters

Each P5 EC2 instances has

  • eight NVIDIA H100 GPUs capable of 16 petaFLOPs of mixed-precision performance
  • 640 GB of high-bandwidth memory, 80GB in each GPU
  • 3,200 Gbps networking connectivity (8x more than the previous generation)

The increased performance of P5 instances accelerates the time-to-train machine learning (ML) models by up to 6x (reducing training time from days to hours), and the additional GPU memory helps customers train larger, more complex models.

P5 instances are expected to lower the cost to train ML models by up to 40% over the previous generation, providing customers greater efficiency over less flexible cloud offerings or expensive on-premises systems.

https://nvidianews.nvidia.com/news/aws-and-nvidia-collaborate-on-next-generation-infrastructure-for-training-large-machine-learning-models-and-building-generative-ai-applications

Nvidia H100 GPU overview and data sheet – https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

Diagram of P4d UltraClusters

P4d consists of 8 A100 GPUs, with 40GB GPU Memory each

P4de consists of 8 A100 80GB GPUs, with 80GB GPU memory each

Nvidia blog on HGX baseboard supporting 8 A100 GPUs – https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

A100 80GB data sheet – https://www.nvidia.com/en-us/data-center/a100/

MIG support in A100 – https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/ and MIG user guide – https://docs.nvidia.com/datacenter/tesla/mig-user-guide

MIG support in AWS EC2 instance type P4d and in AWS EKS – https://developer.nvidia.com/blog/amazon-elastic-kubernetes-services-now-offers-native-support-for-nvidia-a100-multi-instance-gpus/

GCP A2 adds 16 A100 GPUs to a node – https://cloud.google.com/blog/products/compute/announcing-google-cloud-a2-vm-family-based-on-nvidia-a100-gpu

https://cloud.google.com/blog/products/containers-kubernetes/gke-now-supports-multi-instance-gpus

Running more pods/gpu on EKS with MIG – https://medium.com/itnext/run-more-pods-per-gpu-with-nvidia-multi-instance-gpu-d4f7fb07c9b5

Nvidia Embraces The CPU World With “Grace” Arm Server Chip

EC2 Trainium UltraClusters

Each EC2 Trn1 instance has

  • up to 16 AWS Trainium accelerators purpose built to accelerate DL training and deliver up to 3.4 petaflops of FP16/BF16 compute power. Each accelerator includes two second-generation NeuronCores
  • 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth
  • 1600 Gbps of Elastic Fabric Adapter (EFAv2)

An EC2 Trn1 UltraCluster, consists of densely packed, co-located racks of Trn1 compute instances interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.

https://aws.amazon.com/blogs/machine-learning/scaling-large-language-model-llm-training-with-amazon-ec2-trn1-ultraclusters/

Processors for Deep Learning: Nvidia Ampere GPU, Tesla Dojo, AWS Inferentia, Cerebras

The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. Ampere has 84 Streaming Multiprocessors (SMs) with 4 Tensor Cores (TCs) each for a total of 336 TCs. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations.

Here’s an article on choosing the right EC2 instance type for DL – https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86 (G4 for inferencing, P4 for training).

How did the need for specialized DL chips arise, and why are Tensors important in DL ? In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix; under a change of basis the components of the tensor change however the tensor itself does not.

In Deep Learning applications a Tensor is basically a Matrix. The Generalized Matrix Multiplication (GEMM) operation, D=AxB+C, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up.

In Deep Learning, multilinear maps are interleaved with non-linear transforms to model arbitrary transforms of input to output and a specific model is arrived by a process of error reduction on training of actual data. This PyTorch Deep Learning page is an excellent resource to transition from traditional linear algebra to deep learning software – https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html .

Tesla Dojo is planned to build a processor/computer dedicated for Deep Learning to train on vast amounts of video data. Launched on Tesla AI Day, Aug’20 2021, a video at https://www.youtube.com/watch?v=DSw3IwsgNnc

AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores.

AWS Trainium is an ML chip for training.

Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers.

Updates: Cerebras announced a chip which can handle neural networks with 120 trillion parameters, with 850,000 AI optimized cores per chip.

SambaNova, Anton, Cerebras and Graphcore presentations are at https://www.anandtech.com/show/16908/hot-chips-2021-live-blog-machine-learning-graphcore-cerebras-sambanova-anton

SambaNova is building 400,000 AI cores per chip.

NVIDIA GPUAWS InstanceAzure Instance
M60G3
T4G4NVv4
V100P3NCv4
A100P4, P4dNDv4

https://lambdalabs.com/blog/nvidia-a100-vs-v100-benchmarks

Omnisci GPU based columnar database

Omnisci is a columnar database that reads a column into GPU memory, in compressed form, allowing for interactive queries on the data. A single gpu can load 10million to 50million rows of data and allows interactive querying without indexing. A demo was shown at the GTC keynote this year, by Aaron Williams. He gave a talk on vehicle analytics that I attended last month.

In the vehicle telemetry demo, they obtain vehicle telemetry data from an F1 game that has data output as UDP, 10s of thousands of packets a second – take the binary data off of UDP, and convert it to json and use it as a proxy for real telemetry data. The webserver refreshes every 3-4 seconds. The use case is analysis of increasing amounts of vehicle sensor data as discussed in this video and described in the detailed Omnisci blog post here.

Programming of the blocks is done through the UX with StreamSets which is an open source tool to build data pipelines –  a cool demo of streamsets for creating pipelines including kafka is here.

The vehicle analytics demo pipeline consisted of UDP to Kafka, Kafka to JSON, then JSON to OmniSci via pymapd .  Kafka serves as a message broker and also for playback of data.

Based on the the GPU loaded data, the database allows queries and stats on different vehicles that are running.

The entire system runs in the cloud on a VM supporting Nvidia GPUs, and can also be run on a local GPU box.