Large Language Models are big with the bigger ones far exceeding GPU memory, and model parallelism is hard.
Let’s say the foundation models are available such that no training is needed and and one wants to inference against them. This is no small challenge, and a number of techniques have been explored
https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- student-teacher knowledge distillation training, leading to DistilBert
- quantization, quantization-aware training, post-training quantization
- pruning
- architectural optimization, efficient transformers
High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on third on a single GPU.
- model compression
- collaborative inference
- offloading to utilize memory from CPU and disk
They then show 3 contributions
- definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
- quantization of the parameters to 4 bits with small loss of accuracy
- run a OPT-175B model on a single T4 GPU with 16GB memory (!)
PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf
“expanding the context size leads to a quadratic increase in inference costs”
identify identify three main classes of PEFT methods:
- Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
- Selection-based, and
- Reparametrization-based.
General strategies for inference concurrency, courtesy chatgpt:
To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.
Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.
Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.
Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.