Large Language Models take up a lot of GPU memory with the larger ones exceeding GPU memory sizes. Space is taken up my the model weights as well as by in-memory query specific tensor calculations. Model parallelism to store an LLM across multiple GPUs is both expensive and hard. This makes it important to look at techniques to fit an LLM in a single GPU.
Let’s say the foundation models are available such that no further training is needed and and that one (just) wants to inference against them. Inferencing is not a small challenge, and a number of techniques have been explored. Here’s a link – https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ which discusses
- student-teacher knowledge distillation training, leading to DistilBert
- quantization, quantization-aware training, post-training quantization
- pruning
- architectural optimization, efficient transformers
OpenAI link on speeding and scaling LLMs to 100k context windows – https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on a single GPU.
- model compression
- collaborative inference
- offloading to utilize memory from CPU and disk
They then show 3 contributions
- definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
- quantization of the parameters to 4 bits with small loss of accuracy
- run a OPT-175B model on a single T4 GPU with 16GB memory (!)
PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf says ”expanding the context size leads to a quadratic increase in inference costs“
There are three main classes of PEFT methods:
- Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
- Selection-based, and
- Reparametrization-based.
General strategies for inference concurrency, courtesy chatgpt:
To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.
Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.
Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.
Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.
Efficiently scaling transformer inference – link is a paper from Google discussing partitioning of weights and activations across multiple heads and multiple chips (Nov’22).

