Tag: llm

Understanding Reasoning in Thinking Language Models via Steering Vectors – a summary and analysis

November 9, 2025January 4, 2026 · Leave a comment ·

The paper “Understanding Reasoning in Thinking Language Models via Steering Vectors” studies how to control specific reasoning behaviors in DeepSeek-R1-Distill models. It shows that behaviors like backtracking, stating uncertainty, testing examples, and adding extra knowledge can be tied to almost linear directions in the residual stream, and that adding or subtracting these directions at certain layers changes how often the behaviors appear, in a fairly causal way, across several R1-distill sizes and backbones. This is interesting as it allows a non-reasoning model to gain reasoning behaviors by the application of steering vectors.

The authors first build a dataset of 500 reasoning tasks across 10 categories such as math logic, spatial reasoning, causal reasoning, and probabilistic reasoning, generated with Claude 3.5 Sonnet, and then collect long reasoning chains from DeepSeek-R1-Distill models with greedy decoding and up to 1000 tokens per answer. They use GPT-4o to label spans in each chain with six tags: initializing, deduction, adding-knowledge, example-testing, uncertainty-estimation, and backtracking, sometimes splitting a sentence into several labeled pieces.

For each behavior and each transformer layer, they gather residual stream activations on tokens in spans with that behavior (plus the token just before the span), define a positive set of prompts that contain the behavior and a negative set as the full dataset, and compute a Difference-of-Means vector between these sets. They then rescale each candidate vector to have the same norm as the mean activation at that layer so that steering strength is comparable across behaviors and layers.

To find where the model actually uses these features, they apply an attribution patching style analysis: they approximate the effect on next-token KL divergence of adding the candidate vector at behavior tokens, and then average this effect across all examples. Plotting this per layer, they see clear peaks in middle layers for each behavior, while early layers have high overlap with token embeddings and so mostly encode lexical identity rather than reasoning behavior; they drop those early layers and select, for each behavior and model, the mid-layer with the largest causal score as the steering site.arxiv

At inference time, they steer by adding or subtracting the chosen vector at that layer at the relevant positions. On 50 new tasks, pushing the vector (positive steering) reliably increases the share of sentences labeled with the target behavior, while pulling it (negative steering) reduces that share, for backtracking, uncertainty, example-testing, and adding-knowledge. They show qualitative traces where more backtracking leads the model to abandon lines of thought and try new ones, while less backtracking makes it stay committed even when wrong. Cosine similarity studies suggest that most behavior vectors are fairly distinct, though uncertainty and backtracking directions are moderately aligned, matching the idea that the model often expresses doubt before switching approaches.

The main message is that these “thinking” behaviors in DeepSeek-R1-Distill are not just vague global traits but can be localized as steerable directions in activation space, giving a light-weight way to dial up or down aspects of reasoning style without retraining. The work is limited by noisy automatic labels, by focusing only on DeepSeek-R1 distills, and by leaving open whether similar mechanisms appear in other reasoning-tuned or RL-only models such as QwQ, but it offers a practical recipe and evidence that fine-grained reasoning control via activation steering is feasible.

Analysis: Reasoning steering is practical today at small-to-medium scale and in research settings, but is not yet a plug‑and‑play industry tool for production systems. It works reliably on specific models and behaviors (like DeepSeek-R1 distills and backtracking/uncertainty), but there are sharp engineering, robustness, and evaluation challenges that slow broader adoption.

In terms of practicality, the core operations—logging residual activations, computing Difference‑of‑Means vectors, and adding them at inference—are straightforward once you have: model weights, hooks into the residual stream, and a labeled dataset of behaviors. The R1 steering paper shows that with 500 tasks and automatic GPT‑4o labeling they can extract behavior vectors and steer backtracking/uncertainty/example‑testing with clear effect sizes across three architectures (Qwen‑1.5B, Qwen‑14B, Llama‑8B). Representation engineering and activation addition papers similarly report that simple linear vectors often suffice to change style, safety, or reasoning behaviors without fine‑tuning. In practice, this makes steering viable for labs that already run open models with custom inference stacks and can accept some extra forward/backward passes during analysis.

However, several reasons explain why this is not widely adopted in mainstream LLM products. First, infrastructure: most commercial inference stacks are optimized for pure forward passes and do not expose an easy or efficient way to hook and modify residuals token‑by‑token, especially at scale and on MoE/quantized kernels. Activation engineering is often described as “interesting but not yet scalable” for production inference. Second, generalization and brittleness: the R1 paper itself limits claims to selected reasoning behaviors on DeepSeek‑R1-Distill; it explicitly notes uncertainty in how results transfer to other models like QwQ or different RL‑trained reasoners. Many activation‑steering results show strong effects on curated benchmarks but less is known about behavior under distribution shift, long contexts, tool use, or adversarial prompts. Third, safety and product risk: steering can create subtle, hard‑to‑predict couplings (e.g., making the model more “uncertain” may alter refusal, verbosity, or calibration), and product teams usually prefer coarse, better‑understood levers (fine‑tuning, system prompts, decoding) that integrate cleanly with existing evals. Finally, developer ergonomics: designing, validating, and monitoring steering vectors requires interpretability and infra expertise; it is not yet a one‑line config option in popular serving stacks. lesswrong, representation engineering

The R1 steering paper and related work use several techniques to verify that steering is real and not just noise or prompt‑artifact. First, they use attribution patching as a causal test: for each candidate vector and layer, they estimate how much adding that vector changes the next‑token logit distribution via a KL‑based metric; layers with larger effects are taken as causally relevant, and they avoid early layers that correlate strongly with embeddings. Second, they run held‑out evaluations: after picking final behavior vectors and layers, they apply positive and negative steering on 50 unseen tasks and quantify changes in the fraction of sentences labeled as backtracking, uncertainty, example‑testing, or adding‑knowledge by an external annotator model (GPT‑4o). Positive steering increases the targeted behavior fraction; negative steering reduces it, with consistent trends across all three DeepSeek‑R1-Distill models. Third, they check representation geometry: cosine similarity matrices show that different behaviors mostly correspond to distinct directions (with moderate correlation between uncertainty and backtracking), supporting the claim that these are separate mechanisms rather than one generic “thinking” axis. Related representation‑engineering work also compares steered vs. unsteered models on downstream metrics (truthfulness, refusal rate, task score) and often measures KL divergence to ensure capability loss is limited. neelnanda, attribution patching

Several ablations or robustness checks appear across this line of work. In the R1 paper, they vary the layer index and use attribution scores as an “ablation over layers,” effectively showing that removing or modifying the vector at different layers changes the KL impact and identifying mid‑layers as critical for reasoning behaviors. They also evaluate the same behaviors on multiple DeepSeek‑R1-Distill sizes and backbones, which functions as an architectural ablation: similar steering effects across Qwen and Llama distills suggest the phenomenon is not a quirk of one model. In broader steering literature, ablations include: comparing simple Difference‑of‑Means vs. more advanced constructions like Contrastive Activation Addition; toggling steering only at certain positions (e.g., only at reasoning tokens vs. all tokens); mean‑centering vs. not; and scaling the vector magnitude to check monotonicity and detect regimes where steering starts to damage core accuracy. Some papers also run safety‑style ablations: does a refusal vector change helpfulness on benign queries, or does a truthfulness vector degrade performance on standard tasks.

Reasoning steering is already practically usable for research, bespoke agents, and specialized deployments where you control the stack and can afford custom hooks and extra evals. It is not yet widely deployed because the infra and methodology are still bespoke, robustness and transfer are incompletely understood, and product teams need standardized evaluations and safety guarantees that are only starting to emerge from this kind of work.

https://www.lesswrong.com/posts/3ghj8EuKzwD3MQR5G/an-introduction-to-representation-engineering-an-activation

https://www.neelnanda.io/mechanistic-interpretability/attribution-patching

https://aclanthology.org/2024.findings-emnlp.479.pdf

Chronological list of known learned representations (increasing date)

August 10, 2025January 4, 2026 · Leave a comment ·

Chronological list of known learned representations that were explicitly identified, named, and evidenced in a paper/post with reproducible analysis.

The representation basis answers “what algebra the model chooses to live in. The circuit answers “how the transformer computes in that algebra.”

First reported (approx)	Representation (what it is)	Where it shows up	Canonical reference	Importance & generality (researcher comment)
1996	Sparse / wavelet-like (Gabor-like) receptive-field bases	Unsupervised vision models learning efficient codes for natural images	Olshausen & Field, Nature 1996 (Courses at Washington University)	This is one of the earliest clean demonstrations that optimizing a simple objective (sparsity/efficient coding) yields structured bases resembling classical signal representations. It is highly general for natural-image statistics and still conceptually underlies why “edge-like” first-layer features are so universal.
2013 (Jan)	Linear semantic substructure in word-vector spaces (directions encode relations; analogies ≈ parallelograms)	Word embeddings from neural objectives	Mikolov et al. 2013 (word2vec) (arXiv) and Pennington et al. 2014 (GloVe explicitly discusses the analogy geometry) (Stanford NLP)	This made “distributed representations” operational: relations become approximately linear operators/directions. Generality is high across corpora and embedding methods, though the reliability of specific analogies varies and is not guaranteed by training.
2013–2014 (Nov → ECCV)	Early CNN layers learn oriented edge / color-opponency filters (Gabor-like)	Supervised convnets on natural images	Zeiler & Fergus visualization work (arXiv)	Important because it empirically tied deep vision features to classical linear-systems intuition: even with end-to-end supervision, the network “chooses” a near-optimal front-end basis for images. Very general across CNN families trained on natural images.
2014 (Oct)	Differentiable addressing representations (content- and location-based “attention” over external memory)	Memory-augmented networks	Graves et al., Neural Turing Machines (arXiv)	This is a representation of state and retrieval rather than of sensory input: key/value-like addressing emerges as a learnable interface between computation and storage. Generality is moderate: powerful, but most mainstream models replaced explicit external memory with transformer attention over context.
2015 (Nov)	Convolutional algorithmic state representations (Neural GPU learns internal states that generalize addition/multiplication to long lengths)	Algorithm learning on sequences	Kaiser & Sutskever, Neural GPUs Learn Algorithms (arXiv)	This is a landmark for “nets can learn algorithmic latent states,” not just pattern matching. Generality is medium: it works well for certain algorithmic tasks with the right inductive bias, but is not a universal recipe for systematic generalization.
2017 (Oct)	Capsule pose-vector representations (entity presence + instantiation parameters; routing groups parts into wholes)	Vision architectures emphasizing part–whole structure	Sabour et al., Dynamic Routing Between Capsules (arXiv)	Conceptually important: it proposes a factorized internal code (pose/part structure) rather than “bags of features.” Generality is debated in mainstream practice, but the representational idea is crisp and has influenced later equivariant and compositional approaches.
2018 (Mar)	Grid-like spatial codes (grid/border/band-cell-like units)	RNNs trained for path integration / navigation	Cueva & Wei 2018 (arXiv)	Very important scientifically: it shows a strong convergence between trained artificial networks and biological coding hypotheses. Generality is high within navigation/path-integration objectives; less directly portable to arbitrary domains.
2018 (Aug)	Explicit arithmetic representations via specialized units (linear codes + gated primitive ops)	Neural arithmetic modules	Trask et al., NALU (arXiv)	This line is important because it cleanly separates “representation of quantity” from “operators on quantities,” targeting extrapolation. Generality is medium: works best when the task truly factors into arithmetic primitives and the architecture is used appropriately.
2020 (Jun)	Fourier-feature positional encodings / spectral reparameterizations (map inputs through sinusoidal features to defeat spectral bias)	Implicit neural representations; MLPs for signals/scenes	Tancik et al., Fourier Features… (NeurIPS Papers)	Important as a unifying explanation for why plain MLPs underfit high frequencies and how a spectral basis fixes it. Generality is high for continuous regression/INR tasks; it is partly “designed,” but it formalizes the representational need very clearly.
2022 (Sep)	Induction-head representations (“copy-from-previous-match” algorithm; pointer-like behavior)	Transformers doing in-context learning / pattern completion	Olsson et al., In-context Learning and Induction Heads (arXiv)	This is one of the most important circuit-level representational discoveries in transformers: it identifies a reusable mechanism that looks like learned algorithmic pointer-chasing. Generality is high across autoregressive transformers and many ICL-like behaviors.
2022 (Sep)	Superposition of features (many sparse features packed into fewer dimensions; polysemanticity as a geometric tradeoff)	ReLU nets and plausibly large models	Elhage et al., Toy Models of Superposition (arXiv)	Foundational for interpretability: it reframes “neurons are messy” as “the representation is compressed and distributed by necessity.” Generality is extremely high—this is an architectural/optimization-level phenomenon, not a task-specific trick.
2023 (Jan)	Discrete Fourier Transform (DFT) / trig-identity representation for modular addition	Small transformers that grok modular arithmetic	Nanda et al., Progress measures for grokking via mechanistic interpretability (arXiv) (plus walkthrough (Neel Nanda))	The model represents elements in a Fourier basis where modular addition becomes phase addition/rotation. Importance is high as a proof-of-mechanism (nets rediscover classic algebraic representations). Generality is moderate: strongest for tasks with group structure (cyclic groups, convolutions, periodicity).
2023 (Mar–Sep)	Linear “world-state” representations in sequence models (latent state corresponds to board state; controllable by vector arithmetic)	Othello-GPT-style models	Nanda’s exposition (Neel Nanda) and the associated paper on emergent linear representations (arXiv)	Important because it shows a model trained only to predict tokens can learn an explicit internal state (a “world model”) that is linearly recoverable and causally editable. Generality is promising but not universal; it likely emerges when the task forces consistent latent state tracking.
2023 (Oct)	Feature dictionaries / “monosemantic” features via sparse autoencoders (dictionary learning on activations)	Mechanistic interpretability for transformers	Anthropic’s “Towards Monosemanticity” line (Anthropic)	This is less “the model’s native representation” and more “a recovered basis that better matches it,” but it’s crucial: it suggests models are organized around a large set of sparse features even when neurons are polysemantic mixtures. Generality is likely high, and it directly shapes practical interpretability workflows.
2024 (Feb, community analysis)	Chess/Othello-like linear world representations (extensions/replications)	Board-game GPTs; “world model” probing and interventions	Example community writeup (LessWrong)	This is a continuation/expansion of the 2023 world-representation finding. Importance depends on replication rigor, but it is part of the emerging picture that “latent-state tracking” is a common representational strategy in sequence models under the right data/task constraints.

Update: Some more interesting representations

1) Finite-state / automaton-like representations (regular languages)

Transformers trained on formal languages can end up simulating automata, and recent work explicitly extracts finite state machines from trained transformers to characterize what they learned. This is close to “boolean/bitmap logic” in that the latent state is discrete and transitions are rule-like. https://arxiv.org/pdf/2410.06045

2) Stack-like representations for parentheses / Dyck-style tasks

Balanced bracket classification tasks are widely used in mech-interp pedagogy because they pressure the model toward a latent “depth” or stack surrogate. In practice, small transformers often learn a distributed state that tracks nesting structure, sometimes in a way that can be probed linearly. https://arena-chapter1-transformer-interp.streamlit.app/%5B1.5.1%5D_Balanced_Bracket_Classifier

3) “World-state bitmaps” (board-state as a linear code)

In Othello-GPT-style settings, the residual stream contains a linearly recoverable encoding of the board. This is arguably a learned bitmap-like representation (one direction per square / feature), embedded in a continuous space. https://www.neelnanda.io/mechanistic-interpretability/othello

4) Group-operation representations beyond modular addition

A closely related line studies how small nets learn group composition more broadly (a “universality” testbed). This generalizes the “DFT for cyclic groups” story into a broader family of algebraic representations and circuits. https://openreview.net/pdf?id=jCOrkuUpss

5) Boolean satisfiability style reasoning (logical structure)

There is mechanistic-interpretability work on transformer-based models trained to solve 2-SAT, which is a canonical boolean-logic problem. This is a direct example of boolean structure expressed in transformer activations and circuits. https://arxiv.org/html/2407.13594v1

6) Induction / copy (pointer-style algorithm)

Not boolean algebra per se, but it is a very simple learned algorithmic representation: a head learns to represent and retrieve repeated patterns (“copy what followed last time”). This often coexists with more symbolic-feeling representations in toy tasks. https://arxiv.org/abs/2312.03002

Anthropic: Activations to Interpretable features with Monosemanticity

June 29, 2025November 16, 2025 · Leave a comment ·

The Anthropic papers “Towards monosemanticity” and “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” demonstrate how sparse autoencoders can extract interpretable features from large language models, converting polysemantic neuron activations into monosemantic representations that directly map to identifiable concepts and behaviors. In this writeup I try to and explain the core concepts in this research.

A sparse autoencoder is a neural network designed to learn a compact, interpretablerepresentation of input data by enforcing sparsity on its hidden layer activations. A sparse autoencoder is “sparse” because it applies a constraint during training so that, for any given input, only a small subset of the hidden (latent) units is active (nonzero). This is achieved by adding a sparsity penalty to the loss function, commonly L1 regularization or a KL-divergence term, which discourages most activations from deviating much from zero. This ensures the encoded representation is sparse—meaning only a few features are used to reconstruct the input—resulting in greater interpretability and the extraction of meaningful features. It is an “autoencoder” because the full model is trained end-to-end to reconstruct its own input. The encoder maps the input data to a latent code, and the decoder maps it back to the reconstruction. The central training objective is to minimize reconstruction error, making the network learn to reproduce its input as closely as possible. The difference from other autoencoder types (e.g., vanilla, denoising, variational) is specifically the addition of the sparsity constraint on the hidden code.

An activation is the output value of a neuron or unit in a neural network layer after applying an activation function to a weighted sum of inputs. Mathematically, for a neuron receiving inputs x1,x2,…,xnx1,x2,…,xn with weights w1,w2,…,wnw1,w2,…,wn, the activation is a=f(w1x1+w2x2+⋯+wnxn+b)a=f(w1x1+w2x2+⋯+wnxn+b), where ff is the activation function (such as ReLU, sigmoid, or tanh) and bb is a bias term.

The idea is to view activations as superpositions of underlying features and to use a neural network to reverse the mapping from the activations to the features. This is peering into the workings of an LLM with another neural network to see what the activations mean.

So in the monosemanticity quest, the activations are seen as a superposition of underlying features. A sparse autoencoder decomposes model activations into interpretable features by expressing each activation vector as a sparse linear combination of learned feature directions. Given an activation vector xjxj, the decomposition is:xj≈b+∑ifi(xj)dixj≈b+i∑fi(xj)di where fi(xj)fi(xj) is the activation (magnitude) of feature ii, didi is a unit vector representing the direction of feature ii in activation space, and bb is a bias term. The feature activations are computed by the encoder as fi(x)=ReLU(We(x−bd)+be)ifi(x)=ReLU(We(x−bd)+be)i, where WeWe is the encoder weight matrix and bdbd, bebe are pre-encoder and encoder biases. The feature directions are the columns of the decoder weight matrix WdWd. This formulation is dictionary learning: each activation is reconstructed from a sparse set of learned basis vectors scaled by their respective feature activations.

Acts is short for activations in the above figure of a sparse auto encoder functioning from Anthropic. .

Does the SAE look at all the activations or only certain layers ?

Sparse autoencoders are typically trained on activations from specific layers rather than all layers simultaneously. In practice, a separate SAE is trained for each layer or location in the model where one wishes to analyze or intervene on activations. In Anthropic’s “Scaling Monosemanticity” paper specifically, the SAE was trained only on activations from the residual stream at the middle layer (halfway through Claude 3 Sonnet). This choice was made for several reasons: the residual stream is smaller than the MLP layer, making training and inference computationally cheaper; focusing on the residual stream mitigates “cross-layer superposition,” which refers to neurons whose activations depend on combinations of information across multiple layers; and the middle layer likely contains more interesting and abstract features compared to early layers (which capture basic patterns) or final layers (which may be too task-specific).

Motivation and Definitions

Large language models (LLMs) typically exhibit polysemantic neurons, which activate in response to numerous, often unrelated, concepts, impeding interpretability and safe control.
Monosemanticity refers to representations where each learned feature corresponds to a single, easily identifiable concept, thus improving transparency in model operations.
Sparse autoencoders (SAEs) are employed to learn dictionary-like decompositions of hidden activations, aiming for each basis vector (feature) to align with a distinct semantic unit rather than mixed signals.

Methods and Techniques

The approach uses SAEs to project model activations into higher-dimensional, sparse spaces where individual features become interpretable.
Dictionary learning is central: activations from a given layer are encoded by the SAE so that each dictionary element ideally corresponds to a unique concept or pattern.
Anthropic scales this method from small, shallow models to large networks by training SAEs on billions of activations from state-of-the-art LLMs (e.g., Claude 3 Sonnet).
Modifying feature coefficients within the SAE’s learned space causes proportional, causal shifts in the model’s reconstructed activation, allowing direct steering of outputs at runtime.
Feature steering leverages these interpretable directions to alter specific model behaviors (e.g., changing model goals, tone, biases, or inducing controlled errors) by adjusting activation values during inference.

Results and Empirical Findings

The method yields dictionaries where a substantial portion of features (by human evaluation, approximately 70%) are monosemantic—associated with singular, nameable concepts such as DNA motifs or language script.
Quantitative validation includes human raters agreeing with feature names, decoder-row alignment (cosine similarity > 0.86 between encoder and decoder vectors), and strong compositionality in steering outcomes.
Scaling up the size of the SAE dictionary increases the proportion of monosemantic features and the precision of behavioral interventions.
Interventions using these features show robust control over model outputs, evidenced by targeted behavioral scores and ability to suppress or augment specific behaviors with tunable steering coefficients.

Conceptual Advances

The work empirically supports the superposition hypothesis: raw neurons entangle multiple meanings, but sparse dictionary learning untangles these into separately addressable features.
The method demonstrates that high-dimensional, sparsely coded representations can be extracted at scale without significant algorithmic changes, opening new paths for mechanistic interpretability and control tools in LLMs.
These advances suggest dictionary learning could, in future, replace large fine-tuning campaigns for behavioral adjustments, increase safety monitoring, and allow new forms of user-customized steering.

Activation Steering and Implications

Steering methods operate by selecting, amplifying, or suppressing identified sparse features using signed, tunable coefficients (λλ), with each adjustment reflected directly and causally in output behavior.
The process is mathematically tractable because the SAE remains linear; interventions can be analyzed for causal effects and compositional interactions, which is not feasible in the dense activation spaces of standard LLMs.
This enables multifaceted interventions and targeted control: steering vectors can increase or decrease model propensities for specific behaviors, factuality, style, or compliance in a transparent manner.

Summary Table: Key Terms

Term	Definition
Polysemantic neuron	Neural unit that activates for multiple, unrelated concepts
Monosemantic feature	Basis vector representing a single interpretable concept
Sparse autoencoder	Neural model learning an overcomplete, interpretable dictionary
Dictionary learning	Decomposition of activations into a set of sparse, meaningful vectors
Activation	Output value of a neuron or unit in a neural network layer after applying an activation function to a weighted sum of inputs
Activation steering	Modifying activations using interpretable features to control outputs

This research establishes scalable techniques for extracting and manipulating interpretable features in large LLMs, enabling precise behavioral steering and laying groundwork for safer, more controllable AI deployments.

The sparse autoencoder (SAE) in Anthropic’s “Scaling Monosemanticity” paper was trained at three different scales on activations from Claude 3 Sonnet: approximately 1 million (1,048,576), 4 million (4,194,304), and 34 million (33,554,432) features. For the largest run, the 34M-feature SAE, the number of active (nonzero) features for any given token was typically fewer than 300, showing high sparsity.

The paper emphasizes that many extracted features are relevant to AI safety, such as features for security vulnerabilities, code backdoors, bias (overt and subtle), deception (including power-seeking and treacherous turns), sycophancy, and the generation of dangerous or criminal content. However, the authors note that the detection of such features is preliminary and should not be over-interpreted: knowing about harmful behaviors is distinct from enacting them. The presence of potentially dangerous features suggests the model could represent these concepts internally, warranting deeper investigation. The interpretability gained through the SAE allows for the identification and possible intervention on such features but does not automatically ensure safe model behavior without further work and robust evaluation.

The authors compare their feature-extraction approach to previous interpretability and model-steering methods:

Unlike neuron-centric methods, which often yield tangled, polysemantic activations, SAEs learn overcomplete, sparse dictionaries that approximate monosemantic features.
Their approach leverages scaling laws to optimize both the number of features and training steps, showing that larger SAEs provide more granular, precise, and interpretable decompositions than smaller or denser models.
The SAE-based approach allows for explicit, steerable interventions by clamping or zeroing specific features, something not possible with conventional dense neuron manipulation.
The paper positions this technique as extensible, mechanistically transparent, and a foundation for scalable model interpretability—offering capabilities not matched by most prior strategies.

These results highlight that scalable, sparse autoencoders produce directly actionable, interpretable features offering new tools for AI safety and more precise model control compared to traditional neuron or layerwise interpretability approaches.

An argument on the urgency of interpretability: https://www.darioamodei.com/post/the-urgency-of-interpretability

Neel Nanda’s replication of results has a notebook for going deeper. https://www.alignmentforum.org/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s

vLLM project – overview, comparisons, PagedAttention mechanism

September 29, 2024February 10, 2025 · Leave a comment ·

The vLLM project is an open-source venture designed to enhance the efficiency and scalability of serving Large Language Models (LLMs). Developed by researchers at UC Berkeley, vLLM aims to improve the performance of LLM inference by optimizing memory management and execution. It offers a system that reduces latency and increases throughput for LLMs, making it a valuable tool for deploying these models more effectively in various applications. It supports multiple LLM model types, multiple hardware architectures, and multiple optimization techniques. It is described in this paper, on Efficient LLM serving with PagedAttention.

vLLM achieves its improvements through

dynamic batching,
efficient memory usage, and
parallel execution strategies.

These features allow it to handle multiple requests simultaneously without sacrificing speed or accuracy.

By making LLMs more accessible and efficient, vLLM helps lower the barriers to using advanced AI models, facilitating broader adoption and innovation in the field of natural language processing. For more detailed information or to contribute to the project, you can explore its repository on platforms like GitHub.

vLLM, NVIDIA Triton Inference Server, and NVIDIA NeMo (formerly known as NVIDIA NIM) are all designed to improve the deployment and performance of machine learning models, but they have different focuses and functionalities. Here’s a comparison of each:

vLLM

Purpose: Optimizes the serving of Large Language Models (LLMs) with a focus on improving inference efficiency, particularly regarding memory management and execution.
Features: Offers dynamic batching, efficient memory usage, and parallel execution strategies specifically for LLMs, enhancing latency and throughput.
Use Cases: Best suited for applications requiring fast, efficient LLM inference, such as AI-driven conversational agents.
How it reduces memory waste and improves utilization with PagedAttention – https://blog.runpod.io/introduction-to-vllm-and-how-to-run-vllm-on-runpod-serverless/

NVIDIA Triton Inference Server

Purpose: A scalable and flexible platform for serving different types of machine learning models across a variety of frameworks and hardware architectures.
Features: Supports multiple model frameworks (e.g., TensorFlow, PyTorch, ONNX), dynamic batching, model versioning, and provides both HTTP/REST and gRPC endpoints for inference requests. It is designed to maximize GPU utilization and streamline inference workflows.
Use Cases: Ideal for deploying diverse AI models in production environments, allowing for efficient inference at scale across CPUs and GPUs.

NVIDIA NeMo

Purpose: A toolkit for building, training, and fine-tuning state-of-the-art conversational AI models, including those for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS).
Features: Provides pre-trained models, model architectures, and training scripts that can be customized and extended for specific tasks. NeMo is designed to facilitate the development of AI models with high accuracy and efficiency.
Use Cases: Suitable for developers and researchers focused on building and customizing conversational AI applications, offering extensive support for research and development in speech and language domains.

Comparison summary

Optimization Focus: vLLM is specialized for LLM inference optimization, NVIDIA Triton is a general-purpose inference server supporting various models and frameworks, and NVIDIA NeMo is focused on developing and customizing conversational AI models.
Hardware and Framework Support: Triton supports a wide range of frameworks and hardware, optimizing inference across diverse environments. NeMo, while capable of leveraging NVIDIA’s hardware optimizations, is more focused on the model training and customization aspect, particularly for conversational AI.
Target Audience: vLLM targets developers needing efficient LLM deployment; Triton appeals to teams deploying a variety of models in scalable production settings; NeMo is aimed at researchers and developers building state-of-the-art conversational systems.

Details of vLLM PagedAttention.

What Are Keys and Values in PagedAttention?

In the context of transformer-based Large Language Models (LLMs), keys (K) and values (V) are components of the attention mechanism used during inference.

Keys (K): Represent encoded representations of previous tokens, used to determine how much attention each token should pay to previous tokens.
Values (V): Contain the actual information used to generate the next token, weighted based on attention scores.

PagedAttention manages these key-value (KV) caches efficiently to store past token embeddings so the model doesn’t have to recompute them in every step, drastically speeding up inference.

Concrete Example: Key-Value Pairs in Action

Let’s take a simple example where an LLM is generating text based on a prompt.

Example Prompt:

User: "The capital of France is"

Tokenized Version (Using Byte-Pair Encoding or SentencePiece):

["The", "capital", "of", "France", "is"]

Each token gets embedded into a high-dimensional space (e.g., 4096 dimensions for LLaMA-2-70B). Let’s assume we use 4096-dimension embeddings for simplicity.

Step-by-Step Key-Value Storage

The model encodes each token and stores:
- Key (K): A vector that helps determine how relevant this token is in future attention computations.
- Value (V): The actual contextual representation of the token.

Token	Key (K) (Simplified)	Value (V) (Simplified)
“The”	`[0.1, 0.2, -0.3, ...]`	`[0.5, 0.4, -0.1, ...]`
“capital”	`[0.2, 0.3, 0.1, ...]`	`[0.6, 0.2, -0.3, ...]`
“of”	`[-0.1, 0.2, 0.7, ...]`	`[0.2, 0.1, 0.9, ...]`
“France”	`[0.5, -0.2, 0.1, ...]`	`[0.7, 0.3, -0.2, ...]`
“is”	`[0.3, 0.1, 0.4, ...]`	`[0.8, 0.2, -0.5, ...]`

When generating the next token (“Paris”), the model:
- Computes attention scores between “Paris” and all previous tokens using dot product of queries (Q) and keys (K).
- Uses the weighted sum of values (V) to form the new representation.
Instead of recomputing attention from scratch, PagedAttention retrieves precomputed (K, V) values from memory pages for fast lookup.

How PagedAttention Optimizes Key-Value Caching

Without PagedAttention: Each request would store KV pairs in one long, contiguous memory buffer. If a request finishes early, the allocated space is wasted.
With PagedAttention: KV pairs are stored in small pages (e.g., chunks of 16 tokens), allowing efficient reuse and minimizing fragmentation.

LLM evolution – Anthropic , AI21, Cohere, GPT-4

May 14, 2023June 21, 2023 · Leave a comment ·

https://github.com/Mooler0410/LLMsPracticalGuide

Source paper – Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Pink branch is encoder only. Green branch is encoder-decoder. Blue branch is decoder-only.

This is consistent with the Generative aspect of the blue branch. But it does not explain the emergent properties at the top of the blue tree.

LLM leaderboard – https://chat.lmsys.org/?leaderboard

Stanford HELM (holistic evaluation of LMs) – https://crfm.stanford.edu/helm/latest/?models=1

Constitutional AI paper from Anthropic – https://arxiv.org/abs/2212.08073

More on emergent properties in links below.

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1

https://openai.com/research/solving-math-word-problems : Autoregressive models, which generate each solution token by token, have no mechanism to correct their own errors. Solutions that veer off-course quickly become unrecoverable, as can be seen in the examples provided. We address this problem by training verifiers to evaluate the correctness of model-generated solutions. Verifiers are given many possible solutions, all written by the model itself, and they are trained to decide which ones, if any, are correct.

Language Models are Few-Shot Learners – https://openai.com/research/language-models-are-few-shot-learners

LLM inferencing tools/techniques were discussed here.

Secure Machinery

On the evolution of security and intelligent machinery