Tag: transformers

LLM evolution – Anthropic , AI21, Cohere, GPT-4

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

Feature Vectors, Embeddings, Vector Databases, Feature Stores

An ML model consists of a set of weights (or a set of numerical values) that transform inputs to outputs (along with a nonlinear transform such as a sigmoid function). The weights are often organized as vectors or matrices. Consider neural networks, decision trees and support vector machines as types of ML models for this discussion.

The weights representing features of the data (input or intermediate data) are also called feature vectors or vectors. They are also called embeddings, that is embeddings of vectors in a vector space. We discussed such vectors in https://securemachinery.com/2019/05/24/transformer-gpt-2/.

The term “embedding” comes from the idea that the vectors “embed” the original data into a lower-dimensional space. The embedding process involves a combination of statistical and computational techniques, such as factorization and neural networks, that learn to map the input data into the vector space in a way that preserves the relevant properties of the original data.

The use of vectors to represent words in machine learning research started in 2013 with the publication of the paper “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al. This paper introduced the word2vec algorithm, which generates dense vector representations of words based on their distributional properties in a large corpus of text. The size of the vector or embedding in a word embedding model is a hyperparameter that needs to be determined before training the model. It is typically chosen based on the size of the vocabulary and the complexity of the task at hand. In practice, the vector size is often set to be between 100 and 300 dimensions, but this can vary depending on the specific application and the available computational resources. The optimal vector size can be determined through experimentation and tuning of hyperparameters.

One difference between embeddings and feature vectors is that embeddings are typically learned automatically from the data, while feature vectors are typically chosen based on domain knowledge or feature engineering. However these two terms are often used interchangeably. Here is a video going over how the embeddings are obtained from words in a sentence with a bag of words approach- https://www.youtube.com/watch?v=viZrOnJclY0 .

Pinecone, Milvus, Facebook AI Similarity Search (FAISS), Google Vertex Matching engine are examples of Vector databases.

The challenge in implementing a vector database is that traditional databases are not optimized for handling high-dimensional vector data, which is often used in machine learning and data science applications.

Vector data is typically represented as arrays of numbers, where each number represents a feature or attribute of the data. For example, an image might be represented as a high-dimensional vector where each dimension represents the color value of a specific pixel. In contrast to traditional databases, where each record consists of a set of fields or columns, vector databases need to store and index large volumes of high-dimensional data in a way that supports efficient similarity search.

In traditional databases, queries are typically based on simple comparisons of scalar values, such as equality or range queries. However, in vector databases, similarity search is the primary operation, which requires specialized algorithms and data structures to efficiently compute the similarity between vectors. These algorithms are designed to handle high-dimensional data and minimize the amount of computation needed to compare vectors, which can be computationally expensive.

There are several specialized algorithms that are commonly used in vector databases to support efficient similarity search. Here are some examples:

  1. Euclidean Distance: This is a distance metric that measures the straight-line distance between two points in Euclidean space. It is commonly used in vector databases to compute the distance or similarity between vectors.
  2. Cosine Similarity: This is a similarity metric that measures the cosine of the angle between two vectors. It is commonly used in text-based applications to measure the similarity between documents or word embeddings.
  3. Locality-Sensitive Hashing (LSH): This is a technique used to hash high-dimensional vectors into lower-dimensional buckets based on their similarity. It is commonly used in vector databases to speed up similarity search by reducing the number of comparisons needed to find similar vectors.
  4. Product Quantization: This is a technique used to divide high-dimensional vectors into smaller subvectors and quantize them separately. It is commonly used in vector databases to reduce the dimensionality of the data and speed up similarity search.
  5. Inverted Indexing: This is a technique used to index the vectors based on the values of their individual dimensions. It is commonly used in text-based applications to speed up search queries by indexing the terms in the document.

Pinecone provides several indexing and search algorithms, including approximate nearest neighbor search, that are selected automatically based on the properties of the data and the search requirements. However, you can also specify a specific algorithm or tuning parameters when creating an index or performing a query by passing in the appropriate arguments. For example, you can use the method parameter when creating an index to specify the indexing method, or the distance parameter when performing a query to specify the distance metric to use.

While OpenSearch is not specifically designed as a vector database like Pinecone, it provides vector search capabilities through its support for nearest neighbor search. OpenSearch uses the K-Nearest Neighbor (K-NN) algorithm to perform nearest neighbor search for vector data. K-NN is a machine learning algorithm that can be used to find the K nearest neighbors of a query vector in a high-dimensional space. OpenSearch also provides support for approximate nearest neighbor search using algorithms such as Annoy and Hnswlib. To use vector search in OpenSearch, you first need to index your vector data using the appropriate data type (e.g., float or double). You can then perform a nearest neighbor search by specifying the query vector and the number of nearest neighbors to return. OpenSearch also provides support for vector scoring, which allows you to rank search results based on their similarity to a query vector. You can use vector scoring to boost or filter search results based on their similarity to a query vector.

What kind of vectorization schemes are useful for log processing ?

When processing log data, the goal is typically to extract useful information from the log entries and transform them into a format that can be easily analyzed and searched. Vectorization is a common technique used for this purpose, and there are several vectorization schemes that are applicable to log processing. Here are some examples:

  1. Bag-of-words: This is a vectorization scheme that represents a document as a bag of words, where each word is represented by a dimension in the vector and the value of the dimension is the frequency of the word in the document. Bag-of-words can be used to represent log entries as a vector of words, which can be used for tasks such as text classification and anomaly detection.
  2. TF-IDF: This is a vectorization scheme that represents a document as a weighted combination of its term frequency and inverse document frequency. TF-IDF can be used to represent log entries as a vector of weighted words, which can be used for tasks such as information retrieval and text mining.
  3. Word embeddings: This is a vectorization scheme that represents words as dense vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between the words. Word embeddings can be used to represent log entries as a vector of word embeddings, which can be used for tasks such as text classification and entity recognition.
  4. Sequence embeddings: This is a vectorization scheme that represents a sequence of words as a dense vector in a high-dimensional space, where the distance between vectors reflects the similarity between the sequences. Sequence embeddings can be used to represent log entries as a vector of sequence embeddings, which can be used for tasks such as sequence classification and anomaly detection.
  5. One-hot encoding: This is a vectorization scheme that represents categorical data as binary vectors, where each dimension corresponds to a possible category and the value of the dimension is 1 if the data belongs to that category and 0 otherwise. One-hot encoding can be used to represent log entries as a vector of categorical features, which can be used for tasks such as classification and clustering.

By using a suitable vectorization scheme, log data can be transformed into a format that can be easily analyzed and searched, enabling tasks such as anomaly detection, root cause analysis, and performance optimization.

Vector database versus Feature store – what’s the difference ?

Both vector databases and feature stores are used to manage and serve high-dimensional data, such as embeddings, vectors, and other numerical representations, but there are some key differences between the two.

A vector database is a database optimized for storing and querying high-dimensional vector data. It provides efficient indexing and search algorithms, such as approximate nearest neighbor search, that allow for fast and scalable similarity search. Vector databases are commonly used in machine learning applications, such as recommendation systems and natural language processing, where the goal is to find similar items or entities based on their vector representations.

A feature store, on the other hand, is a centralized repository for machine learning features that provides a way to store, manage, and share feature data across different applications and teams. It is designed to help data scientists and machine learning engineers build, test, and deploy machine learning models more efficiently by providing a unified interface for accessing and managing features.

While both vector databases and feature stores can store and serve high-dimensional data, the main difference is their focus and use case. Vector databases are designed for efficient similarity search, while feature stores are designed for feature management and sharing across different applications and teams. In practice, they can complement each other in many machine learning workflows, with the vector database providing the efficient similarity search capabilities and the feature store providing a centralized and standardized way to manage and share feature data.

Comparison of Milvus Pinecone Vespa Weaviate Vald GSI Qdrant – https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696

Anyscale – Using an embeddings database to train an LLM using Ray – https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray

OpenAI embeddings example – https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

HuggingFace sentence embeddings article – https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

AWS – https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8

Hugging Face – AI models and datasets hub

Hugging Face supports around 100,000 pre-trained language models that can be used for various NLP tasks. The Hugging Face transformers library, which is a popular choice for NLP tasks such as text classification and machine translation, currently supports over 100 pre-trained language models. These models include popular models such as BERT, GPT-2, and RoBERTa. In addition Hugging Face provides tools and libraries that allow users to fine-tune and customize these models for specific tasks or datasets.

The datasets can be loaded using the python datasets package (pip install datasets). An overview is here.

A Hugging Face Course – https://github.com/huggingface/course

Hugging Face on AWS blog – https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/

CEO Clement Delangue, calls it the “GitHub of machine learning.” Its emphasis on an open, collaborative approach that made investors confident in the company’s $2 billion valuation, he said. “That’s what is really important to us, makes us successful and makes us different from others in the space.” 

DistilBERT is a smaller, faster, and cheaper version of the BERT language model developed by Hugging Face by controlling the loss function during training of a ‘student model’ from a ‘teacher model’. It bucks the trend towards larger models, and instead focusses on training a more efficient model. It has been “distilled” to reduce its size and computational requirements, making it faster to train and more efficient to run. Despite being smaller than BERT, DistilBERT is able to achieve similar or even slightly better performance on many NLP tasks. The triple loss function is devised to include a distillation loss, a training loss and a cosine-distance loss.

Examples of generative models available on the Hugging Face platform include:

  1. GPT-2: GPT-2 (Generative Pre-training Transformer 2) is a large-scale language model developed by OpenAI that can be used for tasks such as language translation and text generation.
  2. BERT: BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that can be used for tasks such as language translation and text classification.
  3. RoBERTa: RoBERTa (Robustly Optimized BERT Approach) is a language model developed by Facebook that is based on the BERT model and can be used for tasks such as language translation and text classification.
  4. T5: T5 (Text-To-Text Transfer Transformer) is a language model developed by Google that can be used for tasks such as language translation and text summarization.
  5. DistilBERT, described above. To generate text with DistilBERT, you would typically fine-tune the model on a specific task, such as machine translation or language generation, using a dataset that is relevant to the task. Once the model has been fine-tuned, you can use it to generate text by providing it with a prompt or seed text and letting it predict the next word or sequence of words.

Docs on text generation – https://huggingface.co/transformers/v3.1.0/main_classes/model.html?highlight=generate

Here’s an example of using transformers to generate some text.

import transformers

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') 
model = AutoModelWithLMHead.from_pretrained('distilgpt2')  

# Encode the prompt
input_context_prompt = "Men on the moon "
input_ids = tokenizer.encode(input_context_prompt, return_tensors='pt')  # encode input context

# Generate text
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.9, num_return_sequences=10, do_sample=True)  

# Sample candidate outputs and print
for i in range(10): #  10 output sequences were generated
    print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))

Note the temperature parameter during model.generate(). A temperature of zero means the generation process will choose the most likely next word . A higher temperature allows for less likely words to be included in the generation process.

Distributed Training – Parameter server, Data and Model parallelism

Distributed Training aims to reduce the time to train an model in machine learning, by splitting the training workload across multiple nodes. It has gained in importance as data sizes, model sizes and complexity of training have grown. Training consists of iteratively minimizing an objective function by running the data through a model and determining a) the error and the gradients with which to adjust the model parameters (forward path) and b) the updated model parameters using calculated gradients (reverse path). The reverse path always requires synchronization between the nodes, and in some cases the forward path also requires such communication.

There are three approaches to distributed training – data parallelism, model parallelism and data-model parallelism. Data parallelism is the more common approach and is preferred if the model fits in GPU memory (which is increasingly hard for large models).

In data parallelism, we partition the data on to different GPUs and and run the same model on these data partitions. The same model is present in all GPU nodes and no communication between nodes is needed on the forward path. The calculated parameters are sent to a parameter server, which averages them, and updated parameters are retrieved back by all the nodes to update their models to the same incrementally updated model.

In model parallelism, we partition the model itself into parts and run these on different GPUs. This applies to large models such as large language models (LLMs) that do not fit in a single GPU.

A paper on Parameter Servers is here, on Scaling Distributed Machine Learning with the Parameter Server.

To communicate the intermediate results between nodes the MPI primitives are leveraged, including AllReduce.

The amount of training data for BERT is ~600GB. BERT-Tiny model is 17MB, BERT-Base model is ~400MB. During training a 16GB memory GPU sees an OOM error.

Some links to resources –

https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

https://github.com/horovod/horovod/blob/master/docs/concepts.rst (Horovod, an open source parameter server).

https://medium.com/pytorch/how-lyft-uses-pytorch-to-power-machine-learning-for-their-self-driving-cars-80642bc2d0ae

https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

https://towardsdatascience.com/distributed-deep-learning-training-with-horovod-on-kubernetes-6b28ac1d6b5d

https://mccormickml.com/2019/11/05/GLUE/ Origin of General Language Understanding Evaluation.

https://github.com/google-research/bert

https://towardsdatascience.com/model-parallelism-in-one-line-of-code-352b7de5645a

https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

  • Size would be the number of processes, in this case, 16.
  • Rank would be the unique process ID from 0 to 15 (size – 1).
  • Local rank would be the unique process ID within the server from 0 to 3.
  • Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:
Allreduce Illustration
  • Allgather is an operation that gathers data from all processes in a group then sends data back to every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:
Allgather Illustration
  • Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:

Horovod switched from using MPI to using NCCL (NVidia Collective Communications Library) for distributing initial weights and biases, and intermediate weights and biases after each training step .

NCCL is a library that provides primitives for communication between multiple GPUs both within a node and across different nodes.

Horovod continues to use MPI for other functions that do not involve inter-GPU communication, such as informing processes on different nodes of their id (aka rank), master vs non-master status for coordination between processes and for sharing the total number of nodes.

NVidia NCCL uses NVLink which is the hardware interconnect that connects multiple GPUs.

NVLink is a high-speed, point-to-point interconnect technology developed by NVIDIA that is designed to enable high-bandwidth communication between processors, GPUs, and other components in a system.

NVLink 1.0, which was introduced in 2016, provides a maximum bidirectional bandwidth of 80 GB/s per link. This means that data can be transferred between two devices at a rate of up to 80 GB/s in each direction.

NVLink 2.0, which was introduced in 2017, provides a maximum bidirectional bandwidth of 300 GB/s per link. This represents a significant increase in bandwidth compared to NVLink 1.0, and allows for even faster data transfer rates between devices.

NVLink 3.0, which was introduced in 2021, provides a maximum bidirectional bandwidth of 600 GB/s per link, making it the fastest version of NVLink to date.

Multimodal neurons typographic attacks

https://openai.com/blog/multimodal-neurons/

ML Training on images and text together leads to certain neurons holding information of both images and text – multimodal neurons.

When the type of the detected object can be changed by tricking the model into recognizing a textual description instead of a visual description- that can be called a typographic attack.

Intriguing concepts indicating that a fluid crossover from text to images and back is almost here.

There are a few potential security concerns to consider when working with language models:

  1. Data privacy: Language models often require large amounts of data to be trained, and this data may contain sensitive or personal information. It is important to ensure that this data is protected and that appropriate measures are in place to prevent it from being accessed by unauthorized parties.
  2. Model security: Language models can be vulnerable to attacks such as adversarial examples, in which an attacker intentionally manipulates the input to the model in order to cause it to make incorrect predictions. It is important to consider the security of the model and take steps to protect it against these types of attacks.
  3. Misuse: Language models have the potential to be misused, for example by generating fake or misleading content. It is important to consider the potential unintended consequences of using language models and to put safeguards in place to prevent their misuse.
  4. Bias: Language models can sometimes exhibit biases due to the data they are trained on. It is important to consider the potential biases in a model and take steps to mitigate them.
  5. Intellectual property: Language models may be protected by intellectual property laws, and it is important to respect these laws and obtain the appropriate licenses when using language models developed by others.

ML Transformer and GPT-2 Meetup

AI meetup, GPT-2 demo and discussion. Attention!

“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”

“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”

Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.

“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”

GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension, translation or summarization tasks, but instead writing or generative tasks. It uses unsupervised learning to train a deep neural network with a seq2seq model. It does not use reinforcement learning (feedback from environment) or supervised learning. It uses “masked self-attention” to predict the next text during training on its dataset.

The term “generative” is used to emphasize GPT’s ability to generate new, original text, rather than just processing or analyzing text that already exists. A generative model is a type of machine learning model that is trained to produce data, such as text, images, or music, that is similar to the data it was trained on. GPT is a generative model because it is trained on a large corpus of text data and can then generate new text that is similar to the text in its training data. This allows GPT to produce human-like text on a wide range of topics, which can be useful for a variety of applications, such as language translation, text summarization, and question answering.

A “transformer” is a type of neural network architecture that was introduced in 2017. It is a deep learning model that is used for natural language processing tasks, such as language translation and text summarization. A transformer consists of two main components: an encoder, which processes the input text, and a decoder, which generates the output text. The encoder and decoder are connected by a series of attention mechanisms, which allow the model to focus on different parts of the input text as it generates the output. This architecture allows the model to process input text in a parallel, rather than sequential, manner, which makes it more efficient and effective than previous models. The transformer architecture has been widely adopted in natural language processing and has been shown to be highly effective for many tasks.

In a transformer, the “attention” mechanism allows the model to focus on different parts of the input text at different times as it generates the output text. This is different from previous neural network models, which processed the input text sequentially, one word at a time. The attention mechanism in a transformer works by calculating a weight for each word in the input text. This weight represents the importance of that word in the context of the current output word that the model is generating. The model then uses these weights to decide which words in the input text to focus on as it generates the output. This allows the model to selectively focus on the most relevant words in the input text.

https://en.m.wikipedia.org/wiki/Transformer_(machine_learning_model) was initially released June 2017 by Google Brain team.

GPT was released June 2018 by OpenAI.

BERT was released Oct 2018 by Google.

GPT-2 was announced Feb 2019 by OpenAI, trained on 40GB of text.

GPT-3 was introduced May 2020 and in beta testing in July 2020. Trained on 10x the data, or 400GB.

BERT is a response to GPT and GPT-2 is in turn a response to BERT.

This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner in a way that allows sections or bands of it to be referred to later. Although implemented differently it’s a way to keep track of and refer to global state.

AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html

Transformer architecture

BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text.

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Bahdanau, 2014 introduced the concept of Attention https://arxiv.org/abs/1409.0473

“The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”

This description makes it more like a wavelet transform, that does auto-correleations of a signal at different levels of granularity to make sense of it.

Conceptual progression

  1. Input -> Encoder -> Decoder -> Output
  2. Encoder maintains Hidden States to parse/grok the input. These are vectors. Once it goes through the input, it passes the final Hidden State, called the Context forward to the Decoder.
  3. This Context is the bottleneck in the operation of the Decoder.
  4. The Attention concept introduced by Bahdanau and others was to overcome the bottleneck in the Context
  5. With Attention the entire set of intermediate Hidden states is passed on to the Decoder, not just the final Context.
  6. The Decoder does a couple additional steps than before. a) it assigns a score assigned to each Hidden state b) it multiplies the Hidden state with the score. This set of scored vectors are then passed on to the Decoder to produce the Output.