Tag: transformers

Hessians and optimizers

January 18, 2026May 20, 2026 · Leave a comment ·

The Hessian matrices—the second derivatives of the loss function with respect to weights have a pecular organization. For the past twenty years, researchers noticed that these massive matrices are almost entirely concentrated in blocks along the diagonal. This is like a filing cabinet where all the important stuff sits in labeled drawers and the drawers barely talk to each other.

A new paper by Dong, Zhang, Yao, and Sun offers an explaination of why this happens, and their answer is : it’s about the number of classes in your classification problem, not the math of cross-entropy loss as believed earlier. This carries practical consequences for how we train large language models.

Back in 2004, Ronan Collobert noticed something odd while analyzing neural network optimization. When he looked at the Hessian matrix—the landscape of curvature in your loss function—it had a block structure. The diagonal blocks (where parameters interact with themselves) were huge, but the off-diagonal blocks (where different parameter groups interact) were tiny. He proposed an explanation: the cross-entropy loss function creates this through a term called $p (1 - p)$ p(1−p), which goes to zero as training progresses.

Here’s why it didn’t make sense, though nobody noticed: if cross-entropy loss was the culprit, why did the block structure show up before any training even happened? The network could be randomly initialized, weights completely random, and boom—already block-diagonal. You haven’t updated a single weight yet. Also, if you used a different loss function (like mean squared error), you still got block structure with multiple classes, so it wasn’t actually about cross-entropy at all.

The real story is more elegant: the block structure emerges directly from how many classes your problem has. With just two classes, forget it—no block structure. With a thousand classes, you get weak structure. With 32,000 classes like Llama 2’s vocabulary, you get basically perfect block-diagonal structure before you’ve even looked at your first data point.

How the Softmax Creates Block Structure (Even at Initialization)

Let’s walk you through the math, because once you see it, it clicks.

Imagine a linear classifier. You have weights $V$ V (one row per class), input data $x_{n}$ xn, and you want to predict which of $C$ C classes is correct. The softmax gives you: $p_{n, c} = \frac{\exp (v_{c}^{T} x_{n})}{\sum_{j = 1}^{C} \exp (v_{j}^{T} x_{n})}$ pn,c=∑j=1Cexp(vjTxn)exp(vcTxn)

This is a probability distribution—all the $p_{n, c}$ pn,c values sum to one. When weights are randomly initialized with standard initialization (like He or Xavier), something magical happens: because no class has seen any data yet, each class gets approximately equal probability. For 100 classes, $p_{n, c} \approx 1 / 100$ pn,c≈1/100 for each class. For 1,000 classes, $p_{n, c} \approx 1 / 1000$ pn,c≈1/1000. For 32,000 classes, $p_{n, c} \approx 1 / 32000$ pn,c≈1/32000.

Now here’s where it gets interesting. The loss function measures how badly you predicted the correct class: $L = - \frac{1}{N} \sum_{n = 1}^{N} \log p_{n, y_{n}}$ L=−N1n=1∑Nlogpn,yn

To understand the shape of this loss—its curvature in different directions—you compute second derivatives. This is the Hessian.

When you compute the second derivative with respect to class $i$ i‘s parameters, something different happens depending on whether you’re looking at the diagonal blocks (interactions of class $i$ i with itself) versus off-diagonal blocks (interactions of class $i$ i with class $j$ j):

Diagonal blocks (class $i$ i with itself): $H_{i i} = \frac{1}{N} \sum_{n = 1}^{N} p_{n, i} (1 - p_{n, i}) x_{n} x_{n}^{T}$ Hii=N1n=1∑Npn,i(1−pn,i)xnxnT

Notice the coefficient: $p_{n, i} (1 - p_{n, i})$ pn,i(1−pn,i). This has two parts. The first, $p_{n, i}$ pn,i, tells you how often we predict class $i$ i. The second, $(1 - p_{n, i})$ (1−pn,i), tells you how much “room” there is to change that prediction. When $p_{n, i} = 1 / C$ pn,i=1/C, this product is approximately $1 / C$ 1/C.

Off-diagonal blocks (class $i$ i interacting with class $j$ j, where $i \neq j$ i=j): $H_{i j} = - \frac{1}{N} \sum_{n = 1}^{N} p_{n, i} p_{n, j} x_{n} x_{n}^{T}$ Hij=−N1n=1∑Npn,ipn,jxnxnT

Now the coefficient is $p_{n, i} p_{n, j}$ pn,ipn,j—both probabilities multiplied together. When both equal $1 / C$ 1/C, this product is $1 / C^{2}$ 1/C2, which is much smaller than $1 / C$ 1/C.

This is the key: the same data term $x_{n} x_{n}^{T}$ xnxnT appears in both, but the coefficients differ. The diagonal gets a factor of $1 / C$ 1/C while the off-diagonals get $1 / C^{2}$ 1/C2. So when you measure the total size of each block using the Frobenius norm (which is like taking the square root of the sum of all squared entries): $\frac{∥ H_{i j} ∥_{F}}{∥ H_{i i} ∥_{F}} \approx \frac{1 / C^{2}}{1 / C} = \frac{1}{C}$ ∥Hii∥F∥Hij∥F≈1/C1/C2=C1

With $C = 100$ C=100, the off-diagonal is 1% the size of the diagonal. With $C = 32, 000$ C=32,000, it’s 0.003% the size. The matrix is essentially block-diagonal.

But why does softmax create this difference? The answer is in the softmax derivative itself. When you change class $i$ i‘s parameters, you don’t just change $p_{n, i}$ pn,i—you change all the probabilities, because they must sum to one. This coupling is where the $p (1 - p)$ p(1−p) term comes from for the diagonal (the “self-coupling” of class $i$ i), and the $- p q$ −pq cross-term comes from for off-diagonals (how increasing one probability forces others down).

Why This Was Misunderstood for So Long

Collobert saw that cross-entropy had this special $p (1 - p)$ p(1−p) term and thought “aha! That must be why.” But he missed two critical facts:

First, the $p (1 - p)$ p(1−p) term appears in both the diagonal and off-diagonal parts of the Hessian—it’s not special to the diagonal. What’s special is that the ratio between them depends on $C$ C, not on properties of the loss function.

Second, and this is the killer blow: Collobert only tested small numbers of classes. He’d use cross-entropy with multi-class problems (where he saw block structure) and compare to mean squared error with binary classification (where he didn’t). He was comparing apples to oranges—he was varying both the loss and the number of classes simultaneously, so of course he couldn’t figure out which one mattered. It turns out the number of classes is all that matters.

This is why the new paper’s insight is so satisfying. It reveals that you can take the simplest possible loss (just raw squared error), with the simplest architecture (linear!), and still get block-diagonal structure if $C$ C is large. You don’t need any special properties of cross-entropy. You don’t need training to happen. You just need lots of classes.

Before Training Even Starts

This deserves emphasis because it’s genuinely counterintuitive: the Hessian at random initialization is already block-diagonal.

What does this mean in practice? Imagine you initialize a neural network for ImageNet (1,000 classes) with random weights. Before you’ve seen a single data point, before you’ve computed a single gradient, before you’ve updated a single weight: if you were to compute the Hessian matrix of your loss function at those random weights, it would be block-diagonal.

How? Well, the loss function is completely defined. You have a loss that depends on your current weights. The loss is high (your random network predicts garbage), but it’s defined. The second derivative of that loss with respect to your weights is well-defined too. And because of the softmax’s behavior with uniform probabilities, that Hessian has block structure.

This is what the paper means by the “static force”—a force that exists due to architecture, not due to training. The architecture says “we have 1,000 classes” and the softmax probabilites immediately become $p \approx 1 / 1000$ p≈1/1000, and boom, block structure emerges.

Later, during training, another force emerges—the “dynamic force.” As the network learns, the probabilities become less uniform (maybe it’s very confident the image is a dog), and cross-layer interactions evolve. But by then, the block-diagonal foundation is already there.

What This Reveals About Optimizers

This discovery has concrete implications for how we optimize neural networks, especially large language models.

Modern LLMs like Llama 3 have vocabularies of 128,000 words. That’s 128,000 “classes” in a sense—your model needs to predict which word comes next. According to the theory, this means the Hessian is extraordinarily block-diagonal. In fact, with $C = 128, 000$ C=128,000, the off-diagonal blocks are literally 1/128,000 the size of the diagonal blocks.

This is why Adam works so well for LLM training. Adam is a clever optimizer that uses a diagonal approximation of the Hessian: $θ_{t + 1} = θ_{t} - α \frac{m_{t}}{\sqrt{v_{t} + ϵ}}$ θt+1=θt−αvt+ϵmt

What’s $v_{t}$ vt? It’s an estimate of $diag (H)$ diag(H)—just the diagonal of the Hessian, ignoring everything else.

For a fully dense Hessian, ignoring 99% of the matrix would be terrible. But when the Hessian is block-diagonal, ignoring off-diagonal blocks is almost free. You’re throwing away information that barely matters anyway. This is why Adam suddenly becomes effective for LLMs compared to standard gradient descent.

But here’s where it gets better: researchers at Princeton recently realized you can do even better. If the Hessian is block-diagonal, you don’t need the full diagonal—you just need a diagonal per block. This led to Adam-mini, which reduces memory usage by 50% while maintaining the same training quality. Instead of storing diagonal second moments for every single parameter, you compute one second moment per block.

Then there’s Muon, a newer optimizer that goes further. Muon essentially applies a Newton-like update (using inverse Hessian-like information) but only within each block: $W \leftarrow W - α H_{i i}^{- 1} \nabla_{W} L$ W←W−αHii−1∇WL

This works remarkably well for training transformers because each weight matrix has approximately independent curvature. The block-diagonal structure means you’re not ignoring important cross-layer interactions—there basically aren’t any (they’re $O (1 / C)$ O(1/C) times smaller).

The Theoretical Foundation: Random Matrix Theory

The paper proves these results rigorously using techniques from random matrix theory, a branch of mathematics that studies the statistics of large random matrices. The key tool is the Marchenko-Pastur law, which describes how eigenvalues are distributed in sample covariance matrices.

Why is this relevant? Because the Hessian can be written as: $H = \frac{1}{N} \sum_{n = 1}^{N} w_{n, i} x_{n} x_{n}^{T}$ H=N1n=1∑Nwn,ixnxnT

This looks like a weighted sum of outer products—very similar to a covariance matrix. The weights $w_{n, i}$ wn,i depend on the softmax probabilities. The theorem shows that as you have more and more classes, and as the dimension and sample size both grow large, the eigenvalues of these blocks follow a well-known distribution.

To prove this with dependent data (the weights depend on the inputs), the authors use a technique called Lindeberg interpolation. The idea is elegant: you smoothly morph between the actual problem (where weights depend on inputs) and an idealized version (where they don’t), showing that the difference vanishes as you go to infinity. This lets you apply classical random matrix results to a case where they shouldn’t technically apply.

Impact on Understanding Neural Network Optimization

This work changes how we think about several things:

First, it unifies our understanding. Previously, block-diagonal structure seemed mysterious or specific to certain losses. Now we see it as a fundamental consequence of having many classes. This is universal—it applies whether you’re using cross-entropy, mean squared error, or any other loss. The class structure is what matters.

Second, it explains why certain optimizers work. Adam, Muon, and block-diagonal preconditioners aren’t magical—they’re just exploiting an underlying property of the loss landscape that’s been there all along, especially for large $C$ C. This gives confidence that these methods aren’t empirical accidents but are grounded in geometry.

Third, it suggests new optimizers. If you understand the Hessian’s structure, you can design optimizers that respect that structure. You might use different learning rates for different blocks, or apply block-wise preconditioning, or use Newton-like updates within blocks while keeping first-order updates for cross-block terms.

Fourth, it reveals scalability properties. As $C$ C increases (larger vocabularies, more classes), the block structure gets stronger, and diagonal approximations get better. This suggests that bigger problems might actually be easier to optimize in some sense, because the Hessian becomes simpler.

Practical Takeaway for Builders

If you’re building or training large language models, here’s what matters:

Your optimizer choice isn’t arbitrary. Adam works well because your Hessian (determined by your vocabulary size, which is huge) is approximately block-diagonal. More sophisticated second-order methods like Muon work even better because they exploit this structure directly.

If you were training a 10-class image classifier, you might not see much benefit from these structure-aware optimizers—the Hessian isn’t that block-diagonal. But for an LLM with a 100,000-word vocabulary? The structure is so strong that any optimizer ignoring it is leaving performance on the table.

The deeper insight is that the problem size and the problem structure are linked. As you scale up (more classes, bigger vocabularies), you’re not just making the problem bigger—you’re changing its fundamental geometry in ways that make it easier to solve with the right algorithms.

And now, thanks to this paper, we know exactly what that geometry is: a collection of nearly independent blocks, connected only by infinitesimal threads, waiting to be exploited.

LLM evolution – Anthropic , AI21, Cohere, GPT-4

May 14, 2023June 21, 2023 · Leave a comment ·

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

Feature Vectors, Embeddings, Vector Databases, Feature Stores

April 8, 2023June 19, 2023 · 1 Comment ·

An ML model consists of a set of weights (or a set of numerical values) that transform inputs to outputs (along with a nonlinear transform such as a sigmoid function). The weights are often organized as vectors or matrices. Consider neural networks, decision trees and support vector machines as types of ML models for this discussion.

The weights representing features of the data (input or intermediate data) are also called feature vectors or vectors. They are also called embeddings, that is embeddings of vectors in a vector space. We discussed such vectors in https://securemachinery.com/2019/05/24/transformer-gpt-2/.

The term “embedding” comes from the idea that the vectors “embed” the original data into a lower-dimensional space. The embedding process involves a combination of statistical and computational techniques, such as factorization and neural networks, that learn to map the input data into the vector space in a way that preserves the relevant properties of the original data.

The use of vectors to represent words in machine learning research started in 2013 with the publication of the paper “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al. This paper introduced the word2vec algorithm, which generates dense vector representations of words based on their distributional properties in a large corpus of text. The size of the vector or embedding in a word embedding model is a hyperparameter that needs to be determined before training the model. It is typically chosen based on the size of the vocabulary and the complexity of the task at hand. In practice, the vector size is often set to be between 100 and 300 dimensions, but this can vary depending on the specific application and the available computational resources. The optimal vector size can be determined through experimentation and tuning of hyperparameters.

One difference between embeddings and feature vectors is that embeddings are typically learned automatically from the data, while feature vectors are typically chosen based on domain knowledge or feature engineering. However these two terms are often used interchangeably. Here is a video going over how the embeddings are obtained from words in a sentence with a bag of words approach- https://www.youtube.com/watch?v=viZrOnJclY0 .

Pinecone, Milvus, Facebook AI Similarity Search (FAISS), Google Vertex Matching engine are examples of Vector databases.

The challenge in implementing a vector database is that traditional databases are not optimized for handling high-dimensional vector data, which is often used in machine learning and data science applications.

Vector data is typically represented as arrays of numbers, where each number represents a feature or attribute of the data. For example, an image might be represented as a high-dimensional vector where each dimension represents the color value of a specific pixel. In contrast to traditional databases, where each record consists of a set of fields or columns, vector databases need to store and index large volumes of high-dimensional data in a way that supports efficient similarity search.

In traditional databases, queries are typically based on simple comparisons of scalar values, such as equality or range queries. However, in vector databases, similarity search is the primary operation, which requires specialized algorithms and data structures to efficiently compute the similarity between vectors. These algorithms are designed to handle high-dimensional data and minimize the amount of computation needed to compare vectors, which can be computationally expensive.

There are several specialized algorithms that are commonly used in vector databases to support efficient similarity search. Here are some examples:

Euclidean Distance: This is a distance metric that measures the straight-line distance between two points in Euclidean space. It is commonly used in vector databases to compute the distance or similarity between vectors.
Cosine Similarity: This is a similarity metric that measures the cosine of the angle between two vectors. It is commonly used in text-based applications to measure the similarity between documents or word embeddings.
Locality-Sensitive Hashing (LSH): This is a technique used to hash high-dimensional vectors into lower-dimensional buckets based on their similarity. It is commonly used in vector databases to speed up similarity search by reducing the number of comparisons needed to find similar vectors.
Product Quantization: This is a technique used to divide high-dimensional vectors into smaller subvectors and quantize them separately. It is commonly used in vector databases to reduce the dimensionality of the data and speed up similarity search.
Inverted Indexing: This is a technique used to index the vectors based on the values of their individual dimensions. It is commonly used in text-based applications to speed up search queries by indexing the terms in the document.

Pinecone provides several indexing and search algorithms, including approximate nearest neighbor search, that are selected automatically based on the properties of the data and the search requirements. However, you can also specify a specific algorithm or tuning parameters when creating an index or performing a query by passing in the appropriate arguments. For example, you can use the method parameter when creating an index to specify the indexing method, or the distance parameter when performing a query to specify the distance metric to use.

While OpenSearch is not specifically designed as a vector database like Pinecone, it provides vector search capabilities through its support for nearest neighbor search. OpenSearch uses the K-Nearest Neighbor (K-NN) algorithm to perform nearest neighbor search for vector data. K-NN is a machine learning algorithm that can be used to find the K nearest neighbors of a query vector in a high-dimensional space. OpenSearch also provides support for approximate nearest neighbor search using algorithms such as Annoy and Hnswlib. To use vector search in OpenSearch, you first need to index your vector data using the appropriate data type (e.g., float or double). You can then perform a nearest neighbor search by specifying the query vector and the number of nearest neighbors to return. OpenSearch also provides support for vector scoring, which allows you to rank search results based on their similarity to a query vector. You can use vector scoring to boost or filter search results based on their similarity to a query vector.

What kind of vectorization schemes are useful for log processing ?

When processing log data, the goal is typically to extract useful information from the log entries and transform them into a format that can be easily analyzed and searched. Vectorization is a common technique used for this purpose, and there are several vectorization schemes that are applicable to log processing. Here are some examples:

Bag-of-words: This is a vectorization scheme that represents a document as a bag of words, where each word is represented by a dimension in the vector and the value of the dimension is the frequency of the word in the document. Bag-of-words can be used to represent log entries as a vector of words, which can be used for tasks such as text classification and anomaly detection.
TF-IDF: This is a vectorization scheme that represents a document as a weighted combination of its term frequency and inverse document frequency. TF-IDF can be used to represent log entries as a vector of weighted words, which can be used for tasks such as information retrieval and text mining.
Word embeddings: This is a vectorization scheme that represents words as dense vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between the words. Word embeddings can be used to represent log entries as a vector of word embeddings, which can be used for tasks such as text classification and entity recognition.
Sequence embeddings: This is a vectorization scheme that represents a sequence of words as a dense vector in a high-dimensional space, where the distance between vectors reflects the similarity between the sequences. Sequence embeddings can be used to represent log entries as a vector of sequence embeddings, which can be used for tasks such as sequence classification and anomaly detection.
One-hot encoding: This is a vectorization scheme that represents categorical data as binary vectors, where each dimension corresponds to a possible category and the value of the dimension is 1 if the data belongs to that category and 0 otherwise. One-hot encoding can be used to represent log entries as a vector of categorical features, which can be used for tasks such as classification and clustering.

By using a suitable vectorization scheme, log data can be transformed into a format that can be easily analyzed and searched, enabling tasks such as anomaly detection, root cause analysis, and performance optimization.

Vector database versus Feature store – what’s the difference ?

Both vector databases and feature stores are used to manage and serve high-dimensional data, such as embeddings, vectors, and other numerical representations, but there are some key differences between the two.

A vector database is a database optimized for storing and querying high-dimensional vector data. It provides efficient indexing and search algorithms, such as approximate nearest neighbor search, that allow for fast and scalable similarity search. Vector databases are commonly used in machine learning applications, such as recommendation systems and natural language processing, where the goal is to find similar items or entities based on their vector representations.

A feature store, on the other hand, is a centralized repository for machine learning features that provides a way to store, manage, and share feature data across different applications and teams. It is designed to help data scientists and machine learning engineers build, test, and deploy machine learning models more efficiently by providing a unified interface for accessing and managing features.

While both vector databases and feature stores can store and serve high-dimensional data, the main difference is their focus and use case. Vector databases are designed for efficient similarity search, while feature stores are designed for feature management and sharing across different applications and teams. In practice, they can complement each other in many machine learning workflows, with the vector database providing the efficient similarity search capabilities and the feature store providing a centralized and standardized way to manage and share feature data.

Comparison of Milvus Pinecone Vespa Weaviate Vald GSI Qdrant – https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696

Anyscale – Using an embeddings database to train an LLM using Ray – https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray

OpenAI embeddings example – https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

HuggingFace sentence embeddings article – https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a

AWS – https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8

Hugging Face – AI models and datasets hub

May 17, 2022May 31, 2023 · Leave a comment ·

Hugging Face supports around 100,000 pre-trained language models that can be used for various NLP tasks. The Hugging Face transformers library, which is a popular choice for NLP tasks such as text classification and machine translation, currently supports over 100 pre-trained language models. These models include popular models such as BERT, GPT-2, and RoBERTa. In addition Hugging Face provides tools and libraries that allow users to fine-tune and customize these models for specific tasks or datasets.

The datasets can be loaded using the python datasets package (pip install datasets). An overview is here.

A Hugging Face Course – https://github.com/huggingface/course

Hugging Face on AWS blog – https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/

CEO Clement Delangue, calls it the “GitHub of machine learning.” Its emphasis on an open, collaborative approach that made investors confident in the company’s $2 billion valuation, he said. “That’s what is really important to us, makes us successful and makes us different from others in the space.”

DistilBERT is a smaller, faster, and cheaper version of the BERT language model developed by Hugging Face by controlling the loss function during training of a ‘student model’ from a ‘teacher model’. It bucks the trend towards larger models, and instead focusses on training a more efficient model. It has been “distilled” to reduce its size and computational requirements, making it faster to train and more efficient to run. Despite being smaller than BERT, DistilBERT is able to achieve similar or even slightly better performance on many NLP tasks. The triple loss function is devised to include a distillation loss, a training loss and a cosine-distance loss.

Examples of generative models available on the Hugging Face platform include:

GPT-2: GPT-2 (Generative Pre-training Transformer 2) is a large-scale language model developed by OpenAI that can be used for tasks such as language translation and text generation.
BERT: BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that can be used for tasks such as language translation and text classification.
RoBERTa: RoBERTa (Robustly Optimized BERT Approach) is a language model developed by Facebook that is based on the BERT model and can be used for tasks such as language translation and text classification.
T5: T5 (Text-To-Text Transfer Transformer) is a language model developed by Google that can be used for tasks such as language translation and text summarization.
DistilBERT, described above. To generate text with DistilBERT, you would typically fine-tune the model on a specific task, such as machine translation or language generation, using a dataset that is relevant to the task. Once the model has been fine-tuned, you can use it to generate text by providing it with a prompt or seed text and letting it predict the next word or sequence of words.

Docs on text generation – https://huggingface.co/transformers/v3.1.0/main_classes/model.html?highlight=generate

Here’s an example of using transformers to generate some text.

import transformers

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2') 
model = AutoModelWithLMHead.from_pretrained('distilgpt2')  

# Encode the prompt
input_context_prompt = "Men on the moon "
input_ids = tokenizer.encode(input_context_prompt, return_tensors='pt')  # encode input context

# Generate text
outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.9, num_return_sequences=10, do_sample=True)  

# Sample candidate outputs and print
for i in range(10): #  10 output sequences were generated
    print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))

Note the temperature parameter during model.generate(). A temperature of zero means the generation process will choose the most likely next word . A higher temperature allows for less likely words to be included in the generation process.

Distributed Training – Parameter server, Data and Model parallelism

April 10, 2021June 1, 2023 · Leave a comment ·

Distributed Training aims to reduce the time to train an model in machine learning, by splitting the training workload across multiple nodes. It has gained in importance as data sizes, model sizes and complexity of training have grown. Training consists of iteratively minimizing an objective function by running the data through a model and determining a) the error and the gradients with which to adjust the model parameters (forward path) and b) the updated model parameters using calculated gradients (reverse path). The reverse path always requires synchronization between the nodes, and in some cases the forward path also requires such communication.

There are three approaches to distributed training – data parallelism, model parallelism and data-model parallelism. Data parallelism is the more common approach and is preferred if the model fits in GPU memory (which is increasingly hard for large models).

In data parallelism, we partition the data on to different GPUs and and run the same model on these data partitions. The same model is present in all GPU nodes and no communication between nodes is needed on the forward path. The calculated parameters are sent to a parameter server, which averages them, and updated parameters are retrieved back by all the nodes to update their models to the same incrementally updated model.

In model parallelism, we partition the model itself into parts and run these on different GPUs. This applies to large models such as large language models (LLMs) that do not fit in a single GPU.

A paper on Parameter Servers is here, on Scaling Distributed Machine Learning with the Parameter Server.

To communicate the intermediate results between nodes the MPI primitives are leveraged, including AllReduce.

The amount of training data for BERT is ~600GB. BERT-Tiny model is 17MB, BERT-Base model is ~400MB. During training a 16GB memory GPU sees an OOM error.

Some links to resources –

https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

https://github.com/horovod/horovod/blob/master/docs/concepts.rst (Horovod, an open source parameter server).

https://medium.com/pytorch/how-lyft-uses-pytorch-to-power-machine-learning-for-their-self-driving-cars-80642bc2d0ae

https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html

https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/

https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

https://towardsdatascience.com/distributed-deep-learning-training-with-horovod-on-kubernetes-6b28ac1d6b5d

https://mccormickml.com/2019/11/05/GLUE/ Origin of General Language Understanding Evaluation.

https://github.com/google-research/bert

https://towardsdatascience.com/model-parallelism-in-one-line-of-code-352b7de5645a

https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/

Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:

Size would be the number of processes, in this case, 16.
Rank would be the unique process ID from 0 to 15 (size – 1).
Local rank would be the unique process ID within the server from 0 to 3.
Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:

Allgather is an operation that gathers data from all processes in a group then sends data back to every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:

Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:

Horovod switched from using MPI to using NCCL (NVidia Collective Communications Library) for distributing initial weights and biases, and intermediate weights and biases after each training step .

NCCL is a library that provides primitives for communication between multiple GPUs both within a node and across different nodes.

Horovod continues to use MPI for other functions that do not involve inter-GPU communication, such as informing processes on different nodes of their id (aka rank), master vs non-master status for coordination between processes and for sharing the total number of nodes.

NVidia NCCL uses NVLink which is the hardware interconnect that connects multiple GPUs.

NVLink is a high-speed, point-to-point interconnect technology developed by NVIDIA that is designed to enable high-bandwidth communication between processors, GPUs, and other components in a system.

NVLink 1.0, which was introduced in 2016, provides a maximum bidirectional bandwidth of 80 GB/s per link. This means that data can be transferred between two devices at a rate of up to 80 GB/s in each direction.

NVLink 2.0, which was introduced in 2017, provides a maximum bidirectional bandwidth of 300 GB/s per link. This represents a significant increase in bandwidth compared to NVLink 1.0, and allows for even faster data transfer rates between devices.

NVLink 3.0, which was introduced in 2021, provides a maximum bidirectional bandwidth of 600 GB/s per link, making it the fastest version of NVLink to date.

Multimodal neurons typographic attacks

April 4, 2021June 10, 2023 · Leave a comment ·

https://openai.com/blog/multimodal-neurons/

ML Training on images and text together leads to certain neurons holding information of both images and text – multimodal neurons.

When the type of the detected object can be changed by tricking the model into recognizing a textual description instead of a visual description- that can be called a typographic attack.

Intriguing concepts indicating that a fluid crossover from text to images and back is almost here.

There are a few potential security concerns to consider when working with language models:

Data privacy: Language models often require large amounts of data to be trained, and this data may contain sensitive or personal information. It is important to ensure that this data is protected and that appropriate measures are in place to prevent it from being accessed by unauthorized parties.
Model security: Language models can be vulnerable to attacks such as adversarial examples, in which an attacker intentionally manipulates the input to the model in order to cause it to make incorrect predictions. It is important to consider the security of the model and take steps to protect it against these types of attacks.
Misuse: Language models have the potential to be misused, for example by generating fake or misleading content. It is important to consider the potential unintended consequences of using language models and to put safeguards in place to prevent their misuse.
Bias: Language models can sometimes exhibit biases due to the data they are trained on. It is important to consider the potential biases in a model and take steps to mitigate them.
Intellectual property: Language models may be protected by intellectual property laws, and it is important to respect these laws and obtain the appropriate licenses when using language models developed by others.

ML Transformer and GPT-2 Meetup

May 24, 2019December 5, 2022 · 2 Comments ·

AI meetup, GPT-2 demo and discussion. Attention!

“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”

“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”

Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.

“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”

GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension, translation or summarization tasks, but instead writing or generative tasks. It uses unsupervised learning to train a deep neural network with a seq2seq model. It does not use reinforcement learning (feedback from environment) or supervised learning. It uses “masked self-attention” to predict the next text during training on its dataset.

The term “generative” is used to emphasize GPT’s ability to generate new, original text, rather than just processing or analyzing text that already exists. A generative model is a type of machine learning model that is trained to produce data, such as text, images, or music, that is similar to the data it was trained on. GPT is a generative model because it is trained on a large corpus of text data and can then generate new text that is similar to the text in its training data. This allows GPT to produce human-like text on a wide range of topics, which can be useful for a variety of applications, such as language translation, text summarization, and question answering.

A “transformer” is a type of neural network architecture that was introduced in 2017. It is a deep learning model that is used for natural language processing tasks, such as language translation and text summarization. A transformer consists of two main components: an encoder, which processes the input text, and a decoder, which generates the output text. The encoder and decoder are connected by a series of attention mechanisms, which allow the model to focus on different parts of the input text as it generates the output. This architecture allows the model to process input text in a parallel, rather than sequential, manner, which makes it more efficient and effective than previous models. The transformer architecture has been widely adopted in natural language processing and has been shown to be highly effective for many tasks.

In a transformer, the “attention” mechanism allows the model to focus on different parts of the input text at different times as it generates the output text. This is different from previous neural network models, which processed the input text sequentially, one word at a time. The attention mechanism in a transformer works by calculating a weight for each word in the input text. This weight represents the importance of that word in the context of the current output word that the model is generating. The model then uses these weights to decide which words in the input text to focus on as it generates the output. This allows the model to selectively focus on the most relevant words in the input text.

https://en.m.wikipedia.org/wiki/Transformer_(machine_learning_model) was initially released June 2017 by Google Brain team.

GPT was released June 2018 by OpenAI.

BERT was released Oct 2018 by Google.

GPT-2 was announced Feb 2019 by OpenAI, trained on 40GB of text.

GPT-3 was introduced May 2020 and in beta testing in July 2020. Trained on 10x the data, or 400GB.

BERT is a response to GPT and GPT-2 is in turn a response to BERT.

This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner in a way that allows sections or bands of it to be referred to later. Although implemented differently it’s a way to keep track of and refer to global state.

AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html

BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text.

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Bahdanau, 2014 introduced the concept of Attention https://arxiv.org/abs/1409.0473

“The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”

This description makes it more like a wavelet transform, that does auto-correleations of a signal at different levels of granularity to make sense of it.

Conceptual progression

Input -> Encoder -> Decoder -> Output
Encoder maintains Hidden States to parse/grok the input. These are vectors. Once it goes through the input, it passes the final Hidden State, called the Context forward to the Decoder.
This Context is the bottleneck in the operation of the Decoder.
The Attention concept introduced by Bahdanau and others was to overcome the bottleneck in the Context
With Attention the entire set of intermediate Hidden states is passed on to the Decoder, not just the final Context.
The Decoder does a couple additional steps than before. a) it assigns a score assigned to each Hidden state b) it multiplies the Hidden state with the score. This set of scored vectors are then passed on to the Decoder to produce the Output.

Secure Machinery

On the evolution of security and intelligent machinery