Category: deep learning

AlphaFold for protein structure prediction with deep learning – how does it use attention mechanisms

AlphaFold is a deep learning model developed by DeepMind that predicts protein structure. It uses a two-step process: First, it generates a representation of the protein’s amino acid sequence. Then, it refines this representation to predict the 3D structure of the protein. The model is trained on a large database of known protein structures and uses a neural network architecture called a convolutional neural network (CNN) to make these predictions. It leverages the concept of attention mechanisms to incorporate information from multiple parts of the protein sequence during the prediction process. It combines advanced machine learning techniques with protein structure data to make accurate predictions about protein folding.

Attention mechanisms are a key component of AlphaFold and play a crucial role in capturing dependencies between different parts of a protein sequence.

This is an interesting application of attention as it is different from that of word vectors and works on the amino acid sequence instead of a word sequence. Understanding it should deepen one’s appreciation of neural network and attention mechanism as more general techniques than for large language models.

To understand attention mechanisms, let’s break it down step by step.

  1. Embedding the Protein Sequence:
    AlphaFold starts by embedding the amino acid sequence of a protein into a numerical representation. Each amino acid is represented as a vector, and these vectors are combined to form the input sequence matrix, X ∈ ℝ^(L×D), where L is the length of the sequence and D is the dimensionality of each amino acid vector.
  2. Creating Query, Key, and Value Matrices:
    AlphaFold then generates three matrices – Query (Q), Key (K), and Value (V) – by linearly transforming the input sequence matrix X. This transformation is performed using learnable weight matrices WQ, WK, and WV. The resulting matrices are Q = XWQ, K = XWK, and V = XWV, each having dimensions of L×D.
  3. Calculating Attention Weights:
    The attention mechanism computes the similarity between each query vector and key vector by taking their dot products. This similarity is scaled by a factor of √(D), and a softmax function is applied to obtain attention weights. The attention weights determine how much each key contributes to the final output. Let’s denote the attention weights matrix as A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key. The attention weights are calculated as follows: A_ij = softmax((Q_i ⋅ K_j) / √(D)) Here, Q_i represents the i-th row of the Query matrix, and K_j represents the j-th row of the Key matrix.
  4. Weighted Sum of Values:
    The final step is to compute the weighted sum of the Value matrix using the attention weights. This is done by taking the matrix multiplication of attention weights A and the Value matrix V. The resulting matrix C, representing the context or attended representation, is given by: C = AV The context matrix C has dimensions of L×D, where each row represents a weighted sum of the Value vectors based on the attention weights.

Attention mechanisms in AlphaFold allow the model to capture the relationships and dependencies between different parts of the protein sequence. By assigning attention weights to relevant amino acids, the model can focus on important regions during the prediction process, enabling accurate protein structure predictions.

The dimensions of the matrices involved are as follows (simplified):

  • Input Sequence Matrix (X): X ∈ ℝ^(L×D), where L is the length of the protein sequence and D is the dimensionality of each amino acid vector.
  • Query Matrix (Q): Q ∈ ℝ^(L×D), same as the dimensions of X.
  • Key Matrix (K): K ∈ ℝ^(L×D), same as the dimensions of X.
  • Value Matrix (V): V ∈ ℝ^(L×D), same as the dimensions of X.
  • Attention Weights Matrix (A): A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key.
  • Context Matrix (C): C ∈ ℝ^(L×D), same as the dimensions of X.

So the matrices Q, K, V, X, and C have dimensions L×D, where L represents the length of the protein sequence and D represents the dimensionality of the amino acid vectors. The attention weights matrix A has dimensions L×L, capturing the attention weights between each query and key pair.

OpenFold is a pytorch based reproduction of alphafold , a comparison is given in https://wandb.ai/telidavies/ml-news/reports/OpenFold-A-PyTorch-Reproduction-Of-DeepMind-s-AlphaFold–VmlldzoyMjE3MjI5


AlphaFold 2 paper is at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8371605/

Equivariant attention is a special form of attention mechanism that has been adapted to respect the symmetries inherent in the 3D structure of proteins.

To understand the motivation behind this, let’s first recognize a key property of proteins: their function is determined by their 3D structure. However, the orientation of a protein in 3D space doesn’t matter for its function. That is, if you rotate or translate a protein in space, its function remains unchanged. This is a symmetry, and in mathematical terms, the group of transformations that preserves this symmetry is known as the Euclidean group (comprising rotations and translations in 3D space).

Equivariance, in general, refers to the idea that if you change the input in a certain way (e.g., rotate it), the output should change in a corresponding manner without affecting its inherent properties or information.

In the AlphaFold 2 paper, the authors designed their attention mechanism to be “equivariant” to the Euclidean group. This means that the attention mechanism respects the aforementioned symmetries of the protein structure. If you were to rotate or translate the 3D coordinates of a protein’s residues, the output of the attention mechanism would correspondingly rotate or translate in a predictable manner, but the structural information it captures remains the same.

This design choice helps the model to be more robust and efficient in learning and predicting protein structures, as it doesn’t have to relearn patterns for every possible orientation of a protein in space.

The technical details of implementing such an equivariant attention mechanism involve crafting the attention equations to respect these symmetries, and it’s a topic in group theory and deep learning. The idea of introducing equivariance into neural network architectures isn’t entirely new and has seen applications in other fields like computer vision, but its application in the AlphaFold system showcases its utility in the domain of structural biology.


3 Nucleotides make a codon. A codon encodes an amino acid. This table maps the nucleotide sequences to the amino acids. The same codons code for the same amino acids in nearly all organisms

A sequence of amino acids encodes a protein through a process called protein synthesis, which involves two main stages: transcription and translation.

What are amino acids ?

Amino acids are organic compounds that serve as the building blocks of proteins. They contain an amino group (-NH2) and a carboxyl group (-COOH) attached to a central carbon atom, along with a specific side chain (R-group). The side chain varies among different amino acids, giving them unique properties.

There are 20 standard amino acids that are commonly found in proteins. Each amino acid has a unique structure and properties, determined by its specific side chain. Some examples include glycine, alanine, valine, leucine, isoleucine, serine, threonine, cysteine, methionine, aspartic acid, glutamic acid, lysine, arginine, histidine, phenylalanine, tyrosine, tryptophan, asparagine, glutamine, and proline.

Amino acids encode proteins through a process called translation. The genetic information stored in DNA is transcribed into messenger RNA (mRNA). The mRNA is then read by ribosomes, which assemble amino acids in a specific sequence according to the instructions provided by the mRNA. This sequence of amino acids forms a polypeptide chain, which then folds into a functional protein with a specific structure and function. The sequence of amino acids in a protein is determined by the sequence of nucleotides in the corresponding mRNA molecule.

https://www.forbes.com/sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

https://www.nature.com/articles/s41592-023-01924-w A team of researchers led by Peter Kim at Stanford University has performed guided protein evolution using protein language models that were trained on millions of natural protein sequences.

https://aibusiness.com/nlp/meta-lays-off-team-behind-its-protein-folding-model

3D structure of a protein

Transformer revisited

Q: what are similarities and differences between autocorrelation used on signals in signal processing and attention used on sequences of word vectors in transformers in deep learning ?

There are some similarities between the attention mechanism and autocorrelation in signal processing, in the sense that both involve computing a similarity measure between a signal and a shifted version of itself.

In autocorrelation, the similarity measure is typically computed as the dot product between the signal and a shifted version of itself. The resulting autocorrelation function can reveal patterns in the signal that are repeated over time, such as periodicity or other types of regularity.

Similarly, in the attention mechanism, the similarity measure between the query and the keys is computed as the dot product, and the resulting attention scores can reveal patterns in the input that are relevant to the query. These patterns can be used to weight the corresponding values when computing the final attention output.

However, there are also some important differences between the two concepts. In signal processing, autocorrelation is used primarily for analyzing signals, while the attention mechanism is used primarily for modeling relationships between elements in a set or sequence. Additionally, the attention mechanism can be more flexible and powerful than autocorrelation, as it can take into account multiple queries and keys simultaneously, and can be applied to a wide range of tasks in natural language processing, computer vision, and other domains.

Q: How are the Q, K, V matrices derived from the input  X ? can you confirm that these are functions of  X ?

A: The transformer architecture utilizes matrices Q, K, and V which are derived from the input X. The Q, K, and V matrices in the transformer are functions of the input X, derived by applying learned projections. The queries Q represent a lower-dimensional embedding of X optimized for attention, while keys K and values V represent transformations of X optimized for similarity matching and information retrieval. By deriving Q, K, V from X, the transformer architecture can relate elements of the input to one another through scalable self-attention.

Q is derived from X by applying a learned projection to X. “Q is a projection of a learned latent array with index dimension N ≪ M, where the latent’s input is X.” So Q = Projection(X).

K and V are derived by applying projections directly to the input X. Therefore, K = Projection(X) and V = Projection(X).

Q: How are the Q and K matrices incrementally adjusted in the forward and backward passes of back propagation, during training

A: During training with backpropagation:

  • In the forward pass, the query Q and key K matrices are initialized randomly and then get updated through backpropagation to learn meaningful representations.
  • In the backward pass, the gradients from the loss function are propagated back through the network to update the Q and K matrices:

The gradients for Q and K are computed based on the chain rule. The loss L is differentiated w.r.t the attention weights A, which in turn depends on Q and K.

So

  • dL/dQ = dL/dA * dA/dQ and 
  • dL/dK = dL/dA * dA/dK

These gradients are used to update Q and K with optimization techniques like stochastic gradient descent or Adam.

So in the forward pass Q and K are computed, in the backward pass their gradients are computed and they get updated to minimize the loss. This incremental adjustment via backpropagation allows Q and K to learn optimized representations.

Q: in a transformer with Q, K, V matrices where these are projections of the input X, can you confirm that the associated weight matrices for each transform remain fixed for any input X, that is they are a function of the corpus and training and not an individual input X ?

A: The weight matrices used for these projections remain fixed and are not dependent on any individual input X. The weights are learned during pre-training on the corpus and then fixed after training.

Q: in a transformer, Q, K, V matrices are derived from X as Q = XWQ, K = XWK, and V = XWV . what are the exact dimensions of X, WQ, WK, WV here ?

In the context of the Transformer architecture (as introduced in the “Attention Is All You Need” paper by Vaswani et al.), the matrices QQ, KK, and VV are derived from the input XX using weight matrices WQWQ, WKWK, and WVWV, respectively. The exact dimensions of XX, WQWQ, WKWK, and WVWV are determined by the problem setting and the design choices of the transformer model.

However, for clarity, let’s break down the dimensions based on the original Transformer model:

  1. XX: The dimension of XX is [T,dmodel][T,dmodel​], where TT is the sequence length and dmodeldmodel​ is the dimension of the embeddings (e.g., 512 in the original Transformer model).
  2. WQWQ, WKWK, WVWV: These weight matrices are used to project XX into the corresponding QQ, KK, and VV spaces. In the original Transformer, the attention heads operate in a reduced-dimensional space of size dkdk​. So, the dimensions of these matrices are:
    • WQWQ: [dmodel,dk][dmodel​,dk​]
    • WKWK: [dmodel,dk][dmodel​,dk​]
    • WVWV: [dmodel,dv][dmodel​,dv​]

In the case of multi-head attention, there are hh heads, and dmodel=h×dkdmodel​=h×dk​. Typically in the original Transformer, dk=dvdk​=dv​ and is often set to 6464. Hence, for 8 heads (h=8h=8), dmodel=512dmodel​=512.

So, given the above:

  • QQ will have dimensions [T,dk][T,dk​].
  • KK will have dimensions [T,dk][T,dk​].
  • VV will have dimensions [T,dv][T,dv​].

However, keep in mind that in the multi-head attention setting, these projections happen independently for each head, resulting in multiple QQ, KK, and VV matrices – one set for each head.

To decide on the exact dimensions in a specific scenario, one would need to know the exact design choices or configurations of the transformer in question.

Q: so there is a difference between d_model and d_k. how is d_k chosen in that paper ?

A: Yes, there is a distinction between dmodeldmodel​ and dkdk​ in the Transformer architecture. In the original “Attention Is All You Need” paper by Vaswani et al., the multi-head attention mechanism is introduced. Each “head” of this multi-head attention operates in a reduced-dimensional space defined by dkdk​ for queries and keys, and dvdv​ for values.

In the original Transformer model:

dmodel=512dmodel​=512
dk=dv=64dk​=dv​=64

The model uses h=8h=8 heads, which means:

dmodel=h×dkdmodel​=h×dk​

So, 512=8×64512=8×64.

The choice of dk=64dk​=64 in the paper is a design decision, and it’s likely the result of experimentation to find a balance between computational efficiency and model performance. By having a smaller dkdk​, the dot products in the attention mechanism are computed in a reduced-dimensional space, which is more computationally efficient. By using multiple heads, the model can still capture different types of attention patterns in parallel.

The paper does not go into explicit detail about how they arrived at dk=64dk​=64 beyond this design rationale, and the choice could be the result of empirical testing and optimization.
User

Q: how are the multiple different transformer units strung together ? there are multiple Q, K, V matrices in a single neural network as described in the paper, right ?

A: Absolutely. The original “Attention Is All You Need” paper introduced the Transformer model, which consists of multiple layers of transformer units, and each of these units employs the multi-head attention mechanism. I’ll walk you through how these units are strung together and how multiple QQ, KK, VV matrices come into play.

Multi-Head Attention:
    Within a single transformer unit (or layer), there's the concept of multi-head attention. This means the attention mechanism is applied multiple times (with different weight matrices) in parallel. For the original Transformer, this number is h=8h=8.
    For each head, separate QQ, KK, and VV matrices are computed, resulting in hh sets of QQ, KK, VV matrices. These are then used to compute hh different attention outputs.
    These hh attention outputs are concatenated and linearly transformed to produce a single output for that transformer layer.

Stacked Transformer Layers:
    The Transformer model consists of several stacked layers of the transformer units. In the original paper, they used 6 layers for both the encoder and the decoder. Each of these layers has its own set of weight matrices for computing QQ, KK, and VV, and hence, its own multi-head attention mechanism.
    The output from one layer is used as the input to the next layer, allowing for increasingly abstract representations as you move up the layers.

Encoder-Decoder Structure:
    The original Transformer model has an encoder-decoder structure. Each of these consists of multiple transformer layers.
    The encoder takes in the source sequence and produces a representation. This representation is then used by the decoder (alongside the target sequence) to produce the final output.
    In the decoder, there are actually two attention mechanisms in each layer: one that attends to the decoder's own previous outputs (self-attention, like in the encoder) and another that attends to the encoder's output (cross-attention).

Residual Connections & Feed-Forward Networks:
    Besides the attention mechanisms, each transformer layer also contains a position-wise feed-forward network and residual connections. The output from the multi-head attention is passed through this feed-forward network before being sent to the next layer.
    Residual connections help in training deep networks by bypassing layers with the identity function. This is a crucial aspect of the Transformer's architecture.

To visualize, imagine the encoder as a vertical stack of blocks (transformer layers) where each block has 8 separate “channels” or “pipes” (multi-head attention) running through it. The same visualization applies to the decoder, but with an added complexity of having two attention mechanisms in each layer.

Yes, due to the multi-head attention and multiple layers, there are indeed multiple QQ, KK, and VV matrices in the Transformer model as described in the paper.

Q: can you count exactly how many Q,K, V matrices in total are in the above description ? think step by step

A: Certainly! Let’s break it down step by step for both the encoder and the decoder:
Encoder:

Multi-Head Attention within a Single Layer:
    Each transformer layer in the encoder has a multi-head attention mechanism with h=8h=8 heads.
    Therefore, for each layer, there are:
        8 QQ matrices
        8 KK matrices
        8 VV matrices

Stacked Transformer Layers:
    The encoder consists of 6 stacked transformer layers.
    For all layers in the encoder:
        QQ: 6×8=486×8=48
        KK: 6×8=486×8=48
        VV: 6×8=486×8=48

Decoder:

Self-Attention within a Single Layer:
    Similar to the encoder, each transformer layer in the decoder has a self-attention mechanism with h=8h=8 heads.
    For each layer:
        8 QQ matrices for self-attention
        8 KK matrices for self-attention
        8 VV matrices for self-attention

Cross-Attention to Encoder Output within a Single Layer:
    Each transformer layer in the decoder also has a cross-attention mechanism that attends to the encoder's output. This also has h=8h=8 heads.
    For each layer:
        8 QQ matrices for cross-attention (from the decoder)
        8 KK matrices for cross-attention (from the encoder)
        8 VV matrices for cross-attention (from the encoder)

Stacked Transformer Layers in the Decoder:
    The decoder also consists of 6 stacked transformer layers.
    For all layers in the decoder (considering both self-attention and cross-attention):
        QQ: 6×8+6×8=966×8+6×8=96 (48 from self-attention and 48 from cross-attention)
        KK: 6×8+6×8=966×8+6×8=96
        VV: 6×8+6×8=966×8+6×8=96

Total Count:

Adding up both the encoder and the decoder:

Total QQ matrices: 48(fromencoder)+96(fromdecoder)=14448(fromencoder)+96(fromdecoder)=144
Total KK matrices: 48+96=14448+96=144
Total VV matrices: 48+96=14448+96=144

So, in total, across the entire Transformer model (encoder and decoder combined) as described in the original paper, there are:

144 QQ matrices
144 KK matrices
144 VV matrices

Q: why do they use XWQ instead of the more typical linear transform Wq.X ?

A:
The notation “XW^Q” versus “W^QX” is primarily a matter of convention, and the underlying mathematical operation remains the same. Both notations represent matrix multiplication, but the order of the operands is swapped. The choice of one over the other can be influenced by multiple factors, such as:

Data Representation: In some frameworks or notations, data instances are represented as rows in a matrix (often called "batch-first" convention). In others, data instances are columns (often called "sequence-first" or "time-first" convention). The choice of convention can dictate the order of the matrix multiplication to ensure the dimensions align correctly.

Historical and Cultural Preferences: Some fields or communities have preferred conventions based on historical precedent, educational materials, or influential publications.

Computational Efficiency: Depending on the data storage, memory access patterns, and the specific operations used, one form might be more efficient than the other. This is more of a concern when optimizing at a low level or for specific hardware.

In the “Attention Is All You Need” paper and many deep learning publications, the “batch-first” convention is used, which means input data X is represented with batches as rows. In this format, XW^Q makes sense because each row (data instance) of X is being linearly transformed by W^Q.

On the other hand, if you’re coming from a background where column vectors represent data instances (e.g., traditional signal processing or control systems), the notation W^QX might feel more natural.

It’s essential to be flexible in understanding both notations and to always check the context in which they’re used.

Q: what do you mean by input data X is represented with batches as rows. is each row an embedding vector representing a word ?

A: Yes, I apologize for not being clear. You’re correct. In the context of the Transformer architecture and many deep learning models, when we talk about data XX, it often refers to a matrix where:

Each row corresponds to an embedding vector of a token (e.g., a word or subword).
The number of rows represents the sequence length (i.e., the number of tokens in a given sequence).

For instance, let’s say you’re processing a sentence with 10 words, and you’re using an embedding dimension of 512. Your input XX for that sentence would be a 10×51210×512 matrix. Here, each of the 10 rows is a 512-dimensional embedding vector corresponding to each of the 10 words.

Now, when we introduce batching into the mix:

A batch consists of multiple such sequences (sentences) processed together for computational efficiency.
If you have a batch of NN sentences, and each sentence (sequence) has been padded or truncated to have a length of TT tokens, your batched input data would be a N×T×512N×T×512 tensor.

However, when dealing with linear transformations in the context of the Transformer’s attention mechanism, we’re often focused on one sequence at a time (or a batch of sequences but processed in parallel), so we can think of XX as a T×512T×512 matrix, with TT being the sequence length and 512 being the embedding dimension.

The notion “XW^Q” then makes sense in this context, as each row (token embedding) in XX is transformed by the matrix WQWQ to produce the query matrix QQ.

Q: can you give some reference papers for applications of group theory in deep learning

A: Absolutely! Here’s a list of notable papers that touch upon the intersection of group theory and deep learning:

  1. Equivariant Neural Networks:
    • “Group Equivariant Convolutional Networks” by Taco S. Cohen and Max Welling. This paper is foundational for introducing group equivariance into CNNs.
    • “Steerable CNNs” by Taco S. Cohen et al. This work dives into orientation and scale-equivariant CNNs using steerable filters.
  2. Invariant Neural Networks:
    • “Deep Sets” by Manzil Zaheer et al. This paper isn’t about group theory per se, but it deals with the idea of permutation invariance in neural networks, which relates to certain group symmetries.
  3. Transformers and Group Theory:
    • While the application of group theory to transformers isn’t as well-established as with CNNs, the AlphaFold 2 paper (as we discussed) can serve as a reference for equivariant attention mechanisms.
  4. General Understanding and Overview:
    • “A General Theory of Equivariant CNNs on Homogeneous Spaces” by Taco S. Cohen et al. provides a broader understanding of the topic.
  5. Applications in Molecular Biology:
    • “Equivariant Networks for Molecular Property Prediction” by Bowen Cui, Ivan Zhang, and Zachary Teed explores the applications of group equivariance in predicting molecular properties.
  6. For More Advanced Topics and Generalizations:
    • “Gauge Equivariant Convolutional Networks and the Icosahedral CNN” by Taco S. Cohen et al. explores more advanced structures and symmetries.

While this list provides a good starting point, it’s essential to note that the intersection of group theory and deep learning is an active research area, with new contributions regularly. Keeping an eye on major AI conferences and journals will reveal the latest advancements in this domain.

LLM evolution – Anthropic , AI21, Cohere, GPT-4

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

LLM Inferencing is hard – tools and techniques

Large Language Models take up a lot of GPU memory with the larger ones exceeding GPU memory sizes. Space is taken up my the model weights as well as by in-memory query specific tensor calculations. Model parallelism to store an LLM across multiple GPUs is both expensive and hard. This makes it important to look at techniques to fit an LLM in a single GPU.

Let’s say the foundation models are available such that no further training is needed and and that one (just) wants to inference against them. Inferencing is not a small challenge, and a number of techniques have been explored. Here’s a link – https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ which discusses

  • student-teacher knowledge distillation training, leading to DistilBert
  • quantization, quantization-aware training, post-training quantization
  • pruning
  • architectural optimization, efficient transformers

OpenAI link on speeding and scaling LLMs to 100k context windows – https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c

High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on a single GPU.

  • model compression
  • collaborative inference
  • offloading to utilize memory from CPU and disk

They then show 3 contributions

  • definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
  • quantization of the parameters to 4 bits with small loss of accuracy
  • run a OPT-175B model on a single T4 GPU with 16GB memory (!)

PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf says ”expanding the context size leads to a quadratic increase in inference costs

There are three main classes of PEFT methods:

  • Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
  • Selection-based, and
  • Reparametrization-based.

General strategies for inference concurrency, courtesy chatgpt:

To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.

Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.

Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.

Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.

Efficiently scaling transformer inference – link is a paper from Google discussing partitioning of weights and activations across multiple heads and multiple chips (Nov’22).

Feature Vectors, Embeddings, Vector Databases, Feature Stores

An ML model consists of a set of weights (or a set of numerical values) that transform inputs to outputs (along with a nonlinear transform such as a sigmoid function). The weights are often organized as vectors or matrices. Consider neural networks, decision trees and support vector machines as types of ML models for this discussion.

The weights representing features of the data (input or intermediate data) are also called feature vectors or vectors. They are also called embeddings, that is embeddings of vectors in a vector space. We discussed such vectors in https://securemachinery.com/2019/05/24/transformer-gpt-2/.

The term “embedding” comes from the idea that the vectors “embed” the original data into a lower-dimensional space. The embedding process involves a combination of statistical and computational techniques, such as factorization and neural networks, that learn to map the input data into the vector space in a way that preserves the relevant properties of the original data.

The use of vectors to represent words in machine learning research started in 2013 with the publication of the paper “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al. This paper introduced the word2vec algorithm, which generates dense vector representations of words based on their distributional properties in a large corpus of text. The size of the vector or embedding in a word embedding model is a hyperparameter that needs to be determined before training the model. It is typically chosen based on the size of the vocabulary and the complexity of the task at hand. In practice, the vector size is often set to be between 100 and 300 dimensions, but this can vary depending on the specific application and the available computational resources. The optimal vector size can be determined through experimentation and tuning of hyperparameters.

One difference between embeddings and feature vectors is that embeddings are typically learned automatically from the data, while feature vectors are typically chosen based on domain knowledge or feature engineering. However these two terms are often used interchangeably. Here is a video going over how the embeddings are obtained from words in a sentence with a bag of words approach- https://www.youtube.com/watch?v=viZrOnJclY0 .

Pinecone, Milvus, Facebook AI Similarity Search (FAISS), Google Vertex Matching engine are examples of Vector databases.

The challenge in implementing a vector database is that traditional databases are not optimized for handling high-dimensional vector data, which is often used in machine learning and data science applications.

Vector data is typically represented as arrays of numbers, where each number represents a feature or attribute of the data. For example, an image might be represented as a high-dimensional vector where each dimension represents the color value of a specific pixel. In contrast to traditional databases, where each record consists of a set of fields or columns, vector databases need to store and index large volumes of high-dimensional data in a way that supports efficient similarity search.

In traditional databases, queries are typically based on simple comparisons of scalar values, such as equality or range queries. However, in vector databases, similarity search is the primary operation, which requires specialized algorithms and data structures to efficiently compute the similarity between vectors. These algorithms are designed to handle high-dimensional data and minimize the amount of computation needed to compare vectors, which can be computationally expensive.

There are several specialized algorithms that are commonly used in vector databases to support efficient similarity search. Here are some examples:

  1. Euclidean Distance: This is a distance metric that measures the straight-line distance between two points in Euclidean space. It is commonly used in vector databases to compute the distance or similarity between vectors.
  2. Cosine Similarity: This is a similarity metric that measures the cosine of the angle between two vectors. It is commonly used in text-based applications to measure the similarity between documents or word embeddings.
  3. Locality-Sensitive Hashing (LSH): This is a technique used to hash high-dimensional vectors into lower-dimensional buckets based on their similarity. It is commonly used in vector databases to speed up similarity search by reducing the number of comparisons needed to find similar vectors.
  4. Product Quantization: This is a technique used to divide high-dimensional vectors into smaller subvectors and quantize them separately. It is commonly used in vector databases to reduce the dimensionality of the data and speed up similarity search.
  5. Inverted Indexing: This is a technique used to index the vectors based on the values of their individual dimensions. It is commonly used in text-based applications to speed up search queries by indexing the terms in the document.

Pinecone provides several indexing and search algorithms, including approximate nearest neighbor search, that are selected automatically based on the properties of the data and the search requirements. However, you can also specify a specific algorithm or tuning parameters when creating an index or performing a query by passing in the appropriate arguments. For example, you can use the method parameter when creating an index to specify the indexing method, or the distance parameter when performing a query to specify the distance metric to use.

While OpenSearch is not specifically designed as a vector database like Pinecone, it provides vector search capabilities through its support for nearest neighbor search. OpenSearch uses the K-Nearest Neighbor (K-NN) algorithm to perform nearest neighbor search for vector data. K-NN is a machine learning algorithm that can be used to find the K nearest neighbors of a query vector in a high-dimensional space. OpenSearch also provides support for approximate nearest neighbor search using algorithms such as Annoy and Hnswlib. To use vector search in OpenSearch, you first need to index your vector data using the appropriate data type (e.g., float or double). You can then perform a nearest neighbor search by specifying the query vector and the number of nearest neighbors to return. OpenSearch also provides support for vector scoring, which allows you to rank search results based on their similarity to a query vector. You can use vector scoring to boost or filter search results based on their similarity to a query vector.

What kind of vectorization schemes are useful for log processing ?

When processing log data, the goal is typically to extract useful information from the log entries and transform them into a format that can be easily analyzed and searched. Vectorization is a common technique used for this purpose, and there are several vectorization schemes that are applicable to log processing. Here are some examples:

  1. Bag-of-words: This is a vectorization scheme that represents a document as a bag of words, where each word is represented by a dimension in the vector and the value of the dimension is the frequency of the word in the document. Bag-of-words can be used to represent log entries as a vector of words, which can be used for tasks such as text classification and anomaly detection.
  2. TF-IDF: This is a vectorization scheme that represents a document as a weighted combination of its term frequency and inverse document frequency. TF-IDF can be used to represent log entries as a vector of weighted words, which can be used for tasks such as information retrieval and text mining.
  3. Word embeddings: This is a vectorization scheme that represents words as dense vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between the words. Word embeddings can be used to represent log entries as a vector of word embeddings, which can be used for tasks such as text classification and entity recognition.
  4. Sequence embeddings: This is a vectorization scheme that represents a sequence of words as a dense vector in a high-dimensional space, where the distance between vectors reflects the similarity between the sequences. Sequence embeddings can be used to represent log entries as a vector of sequence embeddings, which can be used for tasks such as sequence classification and anomaly detection.
  5. One-hot encoding: This is a vectorization scheme that represents categorical data as binary vectors, where each dimension corresponds to a possible category and the value of the dimension is 1 if the data belongs to that category and 0 otherwise. One-hot encoding can be used to represent log entries as a vector of categorical features, which can be used for tasks such as classification and clustering.

By using a suitable vectorization scheme, log data can be transformed into a format that can be easily analyzed and searched, enabling tasks such as anomaly detection, root cause analysis, and performance optimization.

Vector database versus Feature store – what’s the difference ?

Both vector databases and feature stores are used to manage and serve high-dimensional data, such as embeddings, vectors, and other numerical representations, but there are some key differences between the two.

A vector database is a database optimized for storing and querying high-dimensional vector data. It provides efficient indexing and search algorithms, such as approximate nearest neighbor search, that allow for fast and scalable similarity search. Vector databases are commonly used in machine learning applications, such as recommendation systems and natural language processing, where the goal is to find similar items or entities based on their vector representations.

A feature store, on the other hand, is a centralized repository for machine learning features that provides a way to store, manage, and share feature data across different applications and teams. It is designed to help data scientists and machine learning engineers build, test, and deploy machine learning models more efficiently by providing a unified interface for accessing and managing features.

While both vector databases and feature stores can store and serve high-dimensional data, the main difference is their focus and use case. Vector databases are designed for efficient similarity search, while feature stores are designed for feature management and sharing across different applications and teams. In practice, they can complement each other in many machine learning workflows, with the vector database providing the efficient similarity search capabilities and the feature store providing a centralized and standardized way to manage and share feature data.

Comparison of Milvus Pinecone Vespa Weaviate Vald GSI Qdrant – https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696

Anyscale – Using an embeddings database to train an LLM using Ray – https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray

OpenAI embeddings example – https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

HuggingFace sentence embeddings article – https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

AWS – https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8

EC2 P5 UltraClusters

Each P5 EC2 instances has

  • eight NVIDIA H100 GPUs capable of 16 petaFLOPs of mixed-precision performance
  • 640 GB of high-bandwidth memory, 80GB in each GPU
  • 3,200 Gbps networking connectivity (8x more than the previous generation)

The increased performance of P5 instances accelerates the time-to-train machine learning (ML) models by up to 6x (reducing training time from days to hours), and the additional GPU memory helps customers train larger, more complex models.

P5 instances are expected to lower the cost to train ML models by up to 40% over the previous generation, providing customers greater efficiency over less flexible cloud offerings or expensive on-premises systems.

https://nvidianews.nvidia.com/news/aws-and-nvidia-collaborate-on-next-generation-infrastructure-for-training-large-machine-learning-models-and-building-generative-ai-applications

Nvidia H100 GPU overview and data sheet – https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

Diagram of P4d UltraClusters

P4d consists of 8 A100 GPUs, with 40GB GPU Memory each

P4de consists of 8 A100 80GB GPUs, with 80GB GPU memory each

Nvidia blog on HGX baseboard supporting 8 A100 GPUs – https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

A100 80GB data sheet – https://www.nvidia.com/en-us/data-center/a100/

MIG support in A100 – https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/ and MIG user guide – https://docs.nvidia.com/datacenter/tesla/mig-user-guide

MIG support in AWS EC2 instance type P4d and in AWS EKS – https://developer.nvidia.com/blog/amazon-elastic-kubernetes-services-now-offers-native-support-for-nvidia-a100-multi-instance-gpus/

GCP A2 adds 16 A100 GPUs to a node – https://cloud.google.com/blog/products/compute/announcing-google-cloud-a2-vm-family-based-on-nvidia-a100-gpu

https://cloud.google.com/blog/products/containers-kubernetes/gke-now-supports-multi-instance-gpus

Running more pods/gpu on EKS with MIG – https://medium.com/itnext/run-more-pods-per-gpu-with-nvidia-multi-instance-gpu-d4f7fb07c9b5

Nvidia Embraces The CPU World With “Grace” Arm Server Chip

EC2 Trainium UltraClusters

Each EC2 Trn1 instance has

  • up to 16 AWS Trainium accelerators purpose built to accelerate DL training and deliver up to 3.4 petaflops of FP16/BF16 compute power. Each accelerator includes two second-generation NeuronCores
  • 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth
  • 1600 Gbps of Elastic Fabric Adapter (EFAv2)

An EC2 Trn1 UltraCluster, consists of densely packed, co-located racks of Trn1 compute instances interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.

https://aws.amazon.com/blogs/machine-learning/scaling-large-language-model-llm-training-with-amazon-ec2-trn1-ultraclusters/

Weights vs Activations

Why do Activations need more bits (16) than weights (8) ? source – https://stackoverflow.com/questions/72397839/why-do-activations-need-more-bits-16bit-than-weights-8bit-in-tensor-flows-n

Answer:

Activations are actual signals propagating through the network. They have nothing to do with activation function, this is just a name collision. They are higher accuracy because they are not part of the model, so they do not affect storage, download size, or memory usage, as if you are not training your model you never store activations beyond the current one.

For example for an MLP (middle layer perceptron ?) we have something among the lines of

a1 = relu(W1x + b1)
a2 = relu(W2a1 + b2)
...
an = Wnan-1 + bn

where each W and b will be 8bit parameters. And activations are a1, …, an. The thing is you only need previous and current layer, so to calculate at you just need at-1, and not previous ones, consequently storing them during computation at higher accuracy is just a good tradeoff.

Datastore for Activations:

  • During training, activations are typically stored in the GPU’s memory for models trained on GPUs. This is because backpropagation requires these activations for gradient computation. Given that modern deep learning models can have millions to billions of parameters, storing all these activations can be memory-intensive.
  • During inference, you only need to perform a forward pass and don’t need to store all activations, except for the ones necessary for computing subsequent layers. Once an activation has been used to compute the next layer, it can be discarded if not needed anymore.

Hugging Face – AI models and datasets hub

Hugging Face supports around 100,000 pre-trained language models that can be used for various NLP tasks. The Hugging Face transformers library, which is a popular choice for NLP tasks such as text classification and machine translation, currently supports over 100 pre-trained language models. These models include popular models such as BERT, GPT-2, and RoBERTa. In addition Hugging Face provides tools and libraries that allow users to fine-tune and customize these models for specific tasks or datasets.

The datasets can be loaded using the python datasets package (pip install datasets). An overview is here.

A Hugging Face Course – https://github.com/huggingface/course

Hugging Face on AWS blog – https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/

CEO Clement Delangue, calls it the “GitHub of machine learning.” Its emphasis on an open, collaborative approach that made investors confident in the company’s $2 billion valuation, he said. “That’s what is really important to us, makes us successful and makes us different from others in the space.” 

DistilBERT is a smaller, faster, and cheaper version of the BERT language model developed by Hugging Face by controlling the loss function during training of a ‘student model’ from a ‘teacher model’. It bucks the trend towards larger models, and instead focusses on training a more efficient model. It has been “distilled” to reduce its size and computational requirements, making it faster to train and more efficient to run. Despite being smaller than BERT, DistilBERT is able to achieve similar or even slightly better performance on many NLP tasks. The triple loss function is devised to include a distillation loss, a training loss and a cosine-distance loss.

Examples of generative models available on the Hugging Face platform include:

  1. GPT-2: GPT-2 (Generative Pre-training Transformer 2) is a large-scale language model developed by OpenAI that can be used for tasks such as language translation and text generation.
  2. BERT: BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that can be used for tasks such as language translation and text classification.
  3. RoBERTa: RoBERTa (Robustly Optimized BERT Approach) is a language model developed by Facebook that is based on the BERT model and can be used for tasks such as language translation and text classification.
  4. T5: T5 (Text-To-Text Transfer Transformer) is a language model developed by Google that can be used for tasks such as language translation and text summarization.
  5. DistilBERT, described above. To generate text with DistilBERT, you would typically fine-tune the model on a specific task, such as machine translation or language generation, using a dataset that is relevant to the task. Once the model has been fine-tuned, you can use it to generate text by providing it with a prompt or seed text and letting it predict the next word or sequence of words.

Docs on text generation – https://huggingface.co/transformers/v3.1.0/main_classes/model.html?highlight=generate

Here’s an example of using transformers to generate some text.

import transformers

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') 
model = AutoModelWithLMHead.from_pretrained('distilgpt2')  

# Encode the prompt
input_context_prompt = "Men on the moon "
input_ids = tokenizer.encode(input_context_prompt, return_tensors='pt')  # encode input context

# Generate text
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.9, num_return_sequences=10, do_sample=True)  

# Sample candidate outputs and print
for i in range(10): #  10 output sequences were generated
    print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))

Note the temperature parameter during model.generate(). A temperature of zero means the generation process will choose the most likely next word . A higher temperature allows for less likely words to be included in the generation process.

Distributed Training – Parameter server, Data and Model parallelism

Distributed Training aims to reduce the time to train an model in machine learning, by splitting the training workload across multiple nodes. It has gained in importance as data sizes, model sizes and complexity of training have grown. Training consists of iteratively minimizing an objective function by running the data through a model and determining a) the error and the gradients with which to adjust the model parameters (forward path) and b) the updated model parameters using calculated gradients (reverse path). The reverse path always requires synchronization between the nodes, and in some cases the forward path also requires such communication.

There are three approaches to distributed training – data parallelism, model parallelism and data-model parallelism. Data parallelism is the more common approach and is preferred if the model fits in GPU memory (which is increasingly hard for large models).

In data parallelism, we partition the data on to different GPUs and and run the same model on these data partitions. The same model is present in all GPU nodes and no communication between nodes is needed on the forward path. The calculated parameters are sent to a parameter server, which averages them, and updated parameters are retrieved back by all the nodes to update their models to the same incrementally updated model.

In model parallelism, we partition the model itself into parts and run these on different GPUs. This applies to large models such as large language models (LLMs) that do not fit in a single GPU.

A paper on Parameter Servers is here, on Scaling Distributed Machine Learning with the Parameter Server.

To communicate the intermediate results between nodes the MPI primitives are leveraged, including AllReduce.

The amount of training data for BERT is ~600GB. BERT-Tiny model is 17MB, BERT-Base model is ~400MB. During training a 16GB memory GPU sees an OOM error.

Some links to resources –

https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

https://github.com/horovod/horovod/blob/master/docs/concepts.rst (Horovod, an open source parameter server).

https://medium.com/pytorch/how-lyft-uses-pytorch-to-power-machine-learning-for-their-self-driving-cars-80642bc2d0ae

https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

https://towardsdatascience.com/distributed-deep-learning-training-with-horovod-on-kubernetes-6b28ac1d6b5d

https://mccormickml.com/2019/11/05/GLUE/ Origin of General Language Understanding Evaluation.

https://github.com/google-research/bert

https://towardsdatascience.com/model-parallelism-in-one-line-of-code-352b7de5645a

https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

  • Size would be the number of processes, in this case, 16.
  • Rank would be the unique process ID from 0 to 15 (size – 1).
  • Local rank would be the unique process ID within the server from 0 to 3.
  • Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:
Allreduce Illustration
  • Allgather is an operation that gathers data from all processes in a group then sends data back to every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:
Allgather Illustration
  • Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:

Horovod switched from using MPI to using NCCL (NVidia Collective Communications Library) for distributing initial weights and biases, and intermediate weights and biases after each training step .

NCCL is a library that provides primitives for communication between multiple GPUs both within a node and across different nodes.

Horovod continues to use MPI for other functions that do not involve inter-GPU communication, such as informing processes on different nodes of their id (aka rank), master vs non-master status for coordination between processes and for sharing the total number of nodes.

NVidia NCCL uses NVLink which is the hardware interconnect that connects multiple GPUs.

NVLink is a high-speed, point-to-point interconnect technology developed by NVIDIA that is designed to enable high-bandwidth communication between processors, GPUs, and other components in a system.

NVLink 1.0, which was introduced in 2016, provides a maximum bidirectional bandwidth of 80 GB/s per link. This means that data can be transferred between two devices at a rate of up to 80 GB/s in each direction.

NVLink 2.0, which was introduced in 2017, provides a maximum bidirectional bandwidth of 300 GB/s per link. This represents a significant increase in bandwidth compared to NVLink 1.0, and allows for even faster data transfer rates between devices.

NVLink 3.0, which was introduced in 2021, provides a maximum bidirectional bandwidth of 600 GB/s per link, making it the fastest version of NVLink to date.

NVidia Volta GPU vs Google TPU

A Graphics Processing Unit (GPU) allows multiple hardware processors to act in parallel on a single array of data, allowing a divide and conquer approach to large computational tasks such as video frame rendering, image recognition, and various types of mathematical analysis including convolutional neural networks (CNNs). The GPU is typically placed on a larger chip which includes CPU(s) to direct data to the GPUs. This trend is making supercomputing tasks much cheaper than before .

Tesla_v100 is a System on Chip (SoC) which contains the Volta GPU which contains TensorCores, designed specifically for accelerating deep learning, by accelerating the matrix operation D = A*B+C, each input being a 4×4 matrix.  More on Volta at https://devblogs.nvidia.com/parallelforall/inside-volta/ . It is helpful to read the architecture of the previous Pascal P100 chip which contains the GP100 GPU, described here – http://wccftech.com/nvidia-pascal-specs/ .  Background on why NVidia builds chips this way (SIMD < SIMT < SMT) is here – http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .

Volta GV100 GPU = 6 GraphicsProcessingClusters x  7 TextureProcessingCluster/GraphicsProcessingCluster x 2 StreamingMultiprocessor/TextureProcessingCluster x (64 FP32Units +64 INT32Units + 32 FP64Units +8 TensorCoreUnits +4 TextureUnits)

The FP32 cores are referred to as CUDA cores, which means 84×64 = 5376 CUDA cores per Volta GPU. The Tesla V100 which is the first product (SoC) to use the Volta GPU uses only 80 of the 84 SMs, or 80×64=5120 cores. The frequency of the chip is 1.455Ghz. The Fused-Multiply-Add (FMA) instruction does a multiplication and addition in a single instruction (a*b+c), resulting in 2 FP operations per instruction, giving a FLOPS of 1.455*2*5120=14.9 Tera FLOPs due to the CUDA cores alone. The TensorCores do a 3d Multiply-and-Add with 7x4x4+4×4=128 FP ops/cycle, for a total of 1.455*80*8*128 = 120TFLOPS for deep learning apps.

3D matrix multiplication3d_matrix_multiply

The Volta GPU uses a 12nm manufacturing process, down from 16nm for Pascal. For comparison the Jetson TX1 claims 1TFLOPS and the TX2 twice that (or same performance with half the power of TX1). The VOLTA will be available on Azure, AWS and platforms such as Facebook.  Several applications in Amazon. MS Cognitive toolkit will use it.

For comparison, the Google TPU runs at 700Mhz, and is manufactured with a 28nm process. Instead of FP operations, it uses quantization to integers and a systolic array approach to minimize the watts per matrix multiplication, and optimizes for neural network calculations instead of more general GPU operations.  The TPU uses a design based on an array of 256×256 multiply-accumulate (MAC) units, resulting in 92 Tera Integer ops/second.

Given that NVidia is targeting additional use cases such as computer vision and graphics rendering along with neural network use cases, this approach would not make sense.

Miscellaneous conference notes:

Nvidia DGX-1. “Personal Supercomputer” for $69000 was announced. This contains eight Tesla_v100 accelerators connected over NVLink.

Tesla. FHHL, Full Height, Half Length. Inferencing. Volta is ideal for inferencing, not just training. Also for data centers. Power and cooling use 40% of the datacenter.

As AI data floods the data centers, Volta can replace 500 CPUswith 33 GPUs.
Nvidia GPU cloud. Download the container of your choice. First hybrid deep learning cloud network. Nvidia.com/cloud . Private beta extended to gtc attendees.

Containerization with GPU support. Host has the right NVidia driver. Docker from GPU cloud adapts to the host version. Single docker. Nvidiadocker tool to initialize the drivers.

Moores law comes to an end. Need AI at the edge, far from the data center. Need it to be compact and cheap.

Jetson board had a Tegra SoC chip which has 6cpus and a Pascal GPU.

AWS Device Shadows vs GE Digital Twins. Different focus. Availabaility+connectivity vs operational efficiency. Manufacturing perspective vs operational perspective. Locomotive may  be simulated when disconnected .

DeepInstinct analysed malware data using convolutional neural networks on GPUs, to better detect malware and its variations.

Omni.ai – deep learning for time series data to detect anomalous conditions on sensors on the field such as pressure in a gas pipeline.

GANS applications to various problems – will be refined in next few years.

GeForce 960 video card. Older but popular card for gamers, used the Maxwell GPU, which is older than Pascal GPU.

Cooperative Groups in Cuda9. More on Cuda9.

Neural Network learning

Notes from Autonomous Driving Udacity course.

Weights

 

 

Embedded Neural Nets

A key problem for embedded neural networks is reduction of size and power consumption.

The hardware on which the neural net runs on can be a dedicated chip, an FPGA, a GPU or a CPU. Each of these consumes about 10x the power of the previous choice. But in terms of upfront cost, the dedicated chip costs the highest, the CPU the lowest. An NVidia whitepaper compares GPU with CPU on speed and power consumption. (It discusses key  neural networks like AlexNet. The AlexNet was a breakthrough in 2012 showing a neural network to be superior to other image recognition approaches by a wide margin).

Reducing the size of the neural network also reduces its power consumption. For NN size reduction, pruning of the weak connections in the net was proposed in “Learning both Weights and Connections for Efficient Neural Networks” by Song Han and team at NVidia and Stanford. This achieved a roughly 10x reduction in network size without loss of accuracy. Further work in “Deep Compression” achieved a 35x reduction.

Today I attended a talk on SqueezeNet by Forrest Iandola. His team at Berkeley modified (squeezed) the original architecture, then applied the Deep Compression technique above to achieve a 461x size reduction over the original, to 0.5Mb. This makes it feasible for mobile applications. This paper also references the V.Badrinarayan’s work on SegNet – a different NN architecture, discussed in a talk earlier this year.

The Nervana acquisition by Intel earlier this year was for a low power GPU like SOC chip with very high memory bandwidth.