Tag: ml

Distributed Training

Distributed training aims to reduce the training time of a model in machine learning, by splitting the training workload across multiple nodes. As both the data and the model sizes have grown, distributed training has become an area of focus in ML. Training consists of iteratively minimizing an objective function by running the data through a model and determining a) the error and the gradients with which to adjust the model parameters (forward path) and b) the updated model parameters using calculated gradients (reverse path). The latter step always requires synchronization between the nodes, in some cases the first also requires communication.

There are three approaches to distributed training – data parallelism, model parallelism and data-model parallelism. Data parallelism is more common and preferred if the model fits in GPU memory.

In data parallelism, we partition the data on to different GPUs and and run the same model on these data partitions. The same model is present in all GPU nodes and no communication between nodes is needed on the forward path. The calculated parameters are sent to a parameter server, which averages them, and updated parameters are retrieved back by all the nodes to update their models to the same incrementally updated model.

In model parallelism, we partition the model itself into parts and run these on different GPUs.

To communicate the intermediate results between nodes the MPI primitives are leveraged, including AllReduce.

The amount of training data for BERT is ~600GB. BERT-Tiny model is 17MB, BERT-Base model is ~400MB. During training a 16GB memory GPU sees an OOM error.

Some links to resources –

https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

https://github.com/horovod/horovod/blob/master/docs/concepts.rst

https://medium.com/pytorch/how-lyft-uses-pytorch-to-power-machine-learning-for-their-self-driving-cars-80642bc2d0ae

https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

https://towardsdatascience.com/distributed-deep-learning-training-with-horovod-on-kubernetes-6b28ac1d6b5d

https://mccormickml.com/2019/11/05/GLUE/ Origin of General Language Understanding Evaluation.

https://github.com/google-research/bert

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

  • Size would be the number of processes, in this case, 16.
  • Rank would be the unique process ID from 0 to 15 (size – 1).
  • Local rank would be the unique process ID within the server from 0 to 3.
  • Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:
Allreduce Illustration
  • Allgather is an operation that gathers data from all processes in a group then sends data back to every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:
Allgather Illustration
  • Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:

Processors for Deep Learning: Nvidia A100 GPU, Tesla Dojo, AWS Inferentia

The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. A comparison is provided here. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations.

In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix.

In Deep Learning applications a Tensor is basically a Matrix. The Generalized Matrix Multiplication (GEMM) operation, D=AxB+C, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up.

Tesla Dojo is an advertised attempt to build a processor/computer dedicated for Deep Learning to process vast amounts of video data.

AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores .

Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers.

ML Transformer and GPT-2

Transformer architecture
AI meetup, GPT-2 demo and discussion. Attention!

“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”

“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”

Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.

“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”

GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension/translation/summarization tasks, but writing/generative tasks. BERT is a response to GPT and GPT-2 is in turn a response to BERT. GPT-2 was released Feb’2019 and is trained on 40Gb of text

https://en.m.wikipedia.org/wiki/Transformer_(machine_learning_model)

This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner – just my observation. Although implemented differently it’s a way to keep track of and refer to global state.

AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html

BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text.

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Bahdanau, 2014 https://arxiv.org/abs/1409.0473

“The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”