Category: transformers

AlphaFold for protein structure prediction with deep learning – how does it work

AlphaFold is a deep learning model developed by DeepMind that predicts protein structure. It uses a two-step process: First, it generates a representation of the protein’s amino acid sequence. Then, it refines this representation to predict the 3D structure of the protein. The model is trained on a large database of known protein structures and uses a neural network architecture called a convolutional neural network (CNN) to make these predictions. It leverages the concept of attention mechanisms to incorporate information from multiple parts of the protein sequence during the prediction process. It combines advanced machine learning techniques with protein structure data to make accurate predictions about protein folding.

Attention mechanisms are a key component of AlphaFold and play a crucial role in capturing dependencies between different parts of a protein sequence. To understand attention mechanisms, let’s break it down step by step.

  1. Embedding the Protein Sequence:
    AlphaFold starts by embedding the amino acid sequence of a protein into a numerical representation. Each amino acid is represented as a vector, and these vectors are combined to form the input sequence matrix, X ∈ ℝ^(L×D), where L is the length of the sequence and D is the dimensionality of each amino acid vector.
  2. Creating Query, Key, and Value Matrices:
    AlphaFold then generates three matrices – Query (Q), Key (K), and Value (V) – by linearly transforming the input sequence matrix X. This transformation is performed using learnable weight matrices WQ, WK, and WV. The resulting matrices are Q = XWQ, K = XWK, and V = XWV, each having dimensions of L×D.
  3. Calculating Attention Weights:
    The attention mechanism computes the similarity between each query vector and key vector by taking their dot products. This similarity is scaled by a factor of √(D), and a softmax function is applied to obtain attention weights. The attention weights determine how much each key contributes to the final output. Let’s denote the attention weights matrix as A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key. The attention weights are calculated as follows: A_ij = softmax((Q_i ⋅ K_j) / √(D)) Here, Q_i represents the i-th row of the Query matrix, and K_j represents the j-th row of the Key matrix.
  4. Weighted Sum of Values:
    The final step is to compute the weighted sum of the Value matrix using the attention weights. This is done by taking the matrix multiplication of attention weights A and the Value matrix V. The resulting matrix C, representing the context or attended representation, is given by: C = AV The context matrix C has dimensions of L×D, where each row represents a weighted sum of the Value vectors based on the attention weights.

Attention mechanisms in AlphaFold allow the model to capture the relationships and dependencies between different parts of the protein sequence. By assigning attention weights to relevant amino acids, the model can focus on important regions during the prediction process, enabling accurate protein structure predictions.

The dimensions of the matrices involved are as follows:

  • Input Sequence Matrix (X): X ∈ ℝ^(L×D), where L is the length of the protein sequence and D is the dimensionality of each amino acid vector.
  • Query Matrix (Q): Q ∈ ℝ^(L×D), same as the dimensions of X.
  • Key Matrix (K): K ∈ ℝ^(L×D), same as the dimensions of X.
  • Value Matrix (V): V ∈ ℝ^(L×D), same as the dimensions of X.
  • Attention Weights Matrix (A): A ∈ ℝ^(L×L), where each element A_ij represents the attention weight between the i-th query and j-th key.
  • Context Matrix (C): C ∈ ℝ^(L×D), same as the dimensions of X.

So the matrices Q, K, V, X, and C have dimensions L×D, where L represents the length of the protein sequence and D represents the dimensionality of the amino acid vectors. The attention weights matrix A has dimensions L×L, capturing the attention weights between each query and key pair.

OpenFold is a pytorch based reproduction of alphafold , a comparison is used in https://wandb.ai/telidavies/ml-news/reports/OpenFold-A-PyTorch-Reproduction-Of-DeepMind-s-AlphaFold–VmlldzoyMjE3MjI5

What are amino acids ? Amino acids are organic compounds that serve as the building blocks of proteins. They contain an amino group (-NH2) and a carboxyl group (-COOH) attached to a central carbon atom, along with a specific side chain (R-group). The side chain varies among different amino acids, giving them unique properties.

There are 20 standard amino acids that are commonly found in proteins. Each amino acid has a unique structure and properties, determined by its specific side chain. Some examples include glycine, alanine, valine, leucine, isoleucine, serine, threonine, cysteine, methionine, aspartic acid, glutamic acid, lysine, arginine, histidine, phenylalanine, tyrosine, tryptophan, asparagine, glutamine, and proline.

Amino acids encode proteins through a process called translation. The genetic information stored in DNA is transcribed into messenger RNA (mRNA). The mRNA is then read by ribosomes, which assemble amino acids in a specific sequence according to the instructions provided by the mRNA. This sequence of amino acids forms a polypeptide chain, which then folds into a functional protein with a specific structure and function. The sequence of amino acids in a protein is determined by the sequence of nucleotides in the corresponding mRNA molecule.

Structure of AlphaFold from Nature paper, and as described here.

https://www.forbes.com/sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

https://www.nature.com/articles/s41592-023-01924-w A team of researchers led by Peter Kim at Stanford University has performed guided protein evolution using protein language models that were trained on millions of natural protein sequences.

https://aibusiness.com/nlp/meta-lays-off-team-behind-its-protein-folding-model

https://techcrunch.com/2024/06/25/evolutionaryscale-backed-by-amazon-and-nvidia-raises-142m-for-protein-generating-ai/

https://github.com/evolutionaryscale/esm

https://github.com/aws-samples/drug-discovery-workflows

https://github.com/lucidrains/alphafold3-pytorch

https://github.com/google-deepmind/alphafold

https://fold.it/about_foldit

https://build.nvidia.com/explore/biology

LLM evolution – Anthropic , AI21, Cohere, GPT-4

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

LLM Inferencing is hard – tools and techniques

Large Language Models take up a lot of GPU memory with the larger ones exceeding GPU memory sizes. Space is taken up my the model weights as well as by in-memory query specific tensor calculations. Model parallelism to store an LLM across multiple GPUs is both expensive and hard. This makes it important to look at techniques to fit an LLM in a single GPU.

Let’s say the foundation models are available such that no further training is needed and and that one (just) wants to inference against them. Inferencing is not a small challenge, and a number of techniques have been explored. Here’s a link – https://lilianweng.github.io/posts/2023-01-10-inference-optimization/ which discusses

  • student-teacher knowledge distillation training, leading to DistilBert
  • quantization, quantization-aware training, post-training quantization
  • pruning
  • architectural optimization, efficient transformers

OpenAI link on speeding and scaling LLMs to 100k context windows – https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c

High-throughput Generative Inference of Large Language Models with a Single GPU https://arxiv.org/pdf/2303.06865.pdf, discusses 3 strategies with a focus on a single GPU.

  • model compression
  • collaborative inference
  • offloading to utilize memory from CPU and disk

They then show 3 contributions

  • definition of the optimization search space for offloading, including weights, activations, KV cache, and an algorithm to get an optimal offloading strategy within the search space
  • quantization of the parameters to 4 bits with small loss of accuracy
  • run a OPT-175B model on a single T4 GPU with 16GB memory (!)

PEFT – Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning – https://arxiv.org/pdf/2303.15647.pdf says ”expanding the context size leads to a quadratic increase in inference costs

There are three main classes of PEFT methods:

  • Addition-based, ( Within additive methods, we distinguish two large included groups: Adapter-like methods and Soft prompts)
  • Selection-based, and
  • Reparametrization-based.

General strategies for inference concurrency, courtesy chatgpt:

To process multiple concurrent inference requests without interference between them, a model can use techniques such as parallelization and batching.

Parallelization involves splitting the workload across multiple processing units, such as CPUs or GPUs, so that multiple requests can be processed simultaneously without interfering with each other. This can be achieved using frameworks such as TensorFlow or PyTorch, which provide support for parallel processing.

Batching involves grouping multiple requests together and processing them as a single batch. This can increase the efficiency of the model by reducing the overhead associated with processing each request individually. Batching can be particularly effective for models that are optimized for throughput rather than latency.

Another technique that can be used is dynamic scheduling, which involves assigning resources to requests based on their priority and the availability of resources at a given time. This can help ensure that high-priority requests are processed quickly without interfering with lower-priority requests.

Efficiently scaling transformer inference – link is a paper from Google discussing partitioning of weights and activations across multiple heads and multiple chips (Nov’22).

Hugging Face – AI models and datasets hub

Hugging Face supports around 100,000 pre-trained language models that can be used for various NLP tasks. The Hugging Face transformers library, which is a popular choice for NLP tasks such as text classification and machine translation, currently supports over 100 pre-trained language models. These models include popular models such as BERT, GPT-2, and RoBERTa. In addition Hugging Face provides tools and libraries that allow users to fine-tune and customize these models for specific tasks or datasets.

The datasets can be loaded using the python datasets package (pip install datasets). An overview is here.

A Hugging Face Course – https://github.com/huggingface/course

Hugging Face on AWS blog – https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/

CEO Clement Delangue, calls it the “GitHub of machine learning.” Its emphasis on an open, collaborative approach that made investors confident in the company’s $2 billion valuation, he said. “That’s what is really important to us, makes us successful and makes us different from others in the space.” 

DistilBERT is a smaller, faster, and cheaper version of the BERT language model developed by Hugging Face by controlling the loss function during training of a ‘student model’ from a ‘teacher model’. It bucks the trend towards larger models, and instead focusses on training a more efficient model. It has been “distilled” to reduce its size and computational requirements, making it faster to train and more efficient to run. Despite being smaller than BERT, DistilBERT is able to achieve similar or even slightly better performance on many NLP tasks. The triple loss function is devised to include a distillation loss, a training loss and a cosine-distance loss.

Examples of generative models available on the Hugging Face platform include:

  1. GPT-2: GPT-2 (Generative Pre-training Transformer 2) is a large-scale language model developed by OpenAI that can be used for tasks such as language translation and text generation.
  2. BERT: BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that can be used for tasks such as language translation and text classification.
  3. RoBERTa: RoBERTa (Robustly Optimized BERT Approach) is a language model developed by Facebook that is based on the BERT model and can be used for tasks such as language translation and text classification.
  4. T5: T5 (Text-To-Text Transfer Transformer) is a language model developed by Google that can be used for tasks such as language translation and text summarization.
  5. DistilBERT, described above. To generate text with DistilBERT, you would typically fine-tune the model on a specific task, such as machine translation or language generation, using a dataset that is relevant to the task. Once the model has been fine-tuned, you can use it to generate text by providing it with a prompt or seed text and letting it predict the next word or sequence of words.

Docs on text generation – https://huggingface.co/transformers/v3.1.0/main_classes/model.html?highlight=generate

Here’s an example of using transformers to generate some text.

import transformers

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') 
model = AutoModelWithLMHead.from_pretrained('distilgpt2')  

# Encode the prompt
input_context_prompt = "Men on the moon "
input_ids = tokenizer.encode(input_context_prompt, return_tensors='pt')  # encode input context

# Generate text
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.9, num_return_sequences=10, do_sample=True)  

# Sample candidate outputs and print
for i in range(10): #  10 output sequences were generated
    print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))

Note the temperature parameter during model.generate(). A temperature of zero means the generation process will choose the most likely next word . A higher temperature allows for less likely words to be included in the generation process.