mechanistic interpretability – Secure Machinery

Neurosymbolic reasoning combines two different kinds of computation. A neural network is a learned function, usually trained from data, that maps inputs into vectors and uses those vectors to predict useful outputs. It is good at pattern recognition, fuzzy matching, language, perception, and guessing promising next steps. A symbolic system is a system that manipulates explicit objects such as rules, formulas, programs, graphs, constraints, proofs, or search states. It is good at exactness: a proof step is valid or invalid; a SAT assignment satisfies a formula or it does not; a type checker accepts a program or rejects it.

A GNN, or graph neural network, is a neural network designed for data represented as graphs. A graph has nodes and edges. In SAT, for example, variables and clauses can be represented as nodes, and edges connect variables to the clauses in which they appear. This makes GNNs relevant because many formal problems are not naturally sequences of words; they are structured objects. Code has ASTs, control-flow graphs, and data-flow graphs. SAT formulas have variable-clause graphs. Theorems have dependency graphs of definitions and lemmas. A GNN can learn which parts of such a structure look important.

The core idea of neurosymbolic reasoning is simple: the neural part proposes, ranks, translates, or guides; the symbolic part represents, executes, constrains, or verifies. The neural system is allowed to be approximate because its output is not trusted directly. The symbolic system supplies the hard boundary between plausible and correct. In code, an LLM may propose a patch, but the compiler, tests, static analyzer, or verifier decide whether the patch is acceptable. In SAT, a neural model may suggest which variable to branch on, but the SAT solver performs exact search and a proof checker verifies the result. In theorem proving, a model may suggest a Lean tactic, but Lean checks whether the proof step is valid.

The reason this combination matters is that pure neural systems and pure symbolic systems have opposite strengths. Neural systems handle ambiguity and large messy inputs, but they can hallucinate or skip conditions. Symbolic systems are exact and compositional, but they often face enormous search spaces and require humans to formalize the problem precisely. Neurosymbolic reasoning is useful when a problem is both messy and exact: messy enough that learned guidance helps, but exact enough that unchecked guesses are dangerous.

Formal methods are a central use case because their hard part is usually not checking a finished proof. The hard part is finding the right specification, invariant, lemma, induction variable, proof tactic, or decomposition. A proof assistant can mechanically verify a proof, but it may not know which theorem to apply next. A SAT solver can prove unsatisfiability, but it may drown in bad branching choices. A verifier can check loop invariants, but someone must often invent those invariants. Neural networks help by searching this space of useful intermediate ideas.

SAT illustrates the division cleanly. For a satisfiable formula, the certificate is an assignment of truth values that makes every clause true. For an unsatisfiable formula, the certificate is a proof of contradiction, often in a format such as a resolution-style proof. A neural network can suggest a promising variable, a likely unsat core, or a useful learned clause. But the final answer must still be checked by a symbolic solver or proof checker. The neural model does not make SAT “true” or “false”; it helps navigate the search.

This does not mean neural SAT solvers are on a direct path to proving P = NP. P = NP would require a uniform polynomial-time method for solving every SAT instance in the worst case. Neural guidance can make many real instances dramatically easier, especially when they contain recurring structure from hardware, software, scheduling, planning, or verification problems. But worst-case SAT includes adversarial formulas designed to defeat heuristics. Better guidance can move the practical frontier without changing the worst-case complexity frontier.

The deeper promise is not that neural networks replace logic. The promise is that they make symbolic reasoning more usable. They can translate informal intent into candidate formal specifications, suggest missing invariants, rank lemmas, choose solver strategies, identify useful graph structure, and repair failed proof attempts. The symbolic system then checks whether these guesses actually satisfy the rules. This gives a division of labor: neural networks provide search intelligence; symbolic systems provide correctness.

There are two meanings of “neurosymbolic” that should be kept separate. The first is external neurosymbolic reasoning, where a neural model calls or guides explicit tools such as SAT solvers, proof assistants, compilers, planners, or databases. This is the practical and trustworthy version because the symbolic tool can reject invalid output. The second is internal symbolic representation, where researchers ask whether neural networks themselves learn vector representations that behave like variables, rules, types, objects, or relations. That is important for interpretability, but it is harder to trust because the “symbols” are implicit and distributed inside activations.

The main risk is not usually that the symbolic checker accepts an invalid proof. A good checker should catch that. The larger risk is proving or checking the wrong thing. A program can be verified against an incomplete specification. A sorter can be proved to return an ordered list while still dropping all input elements. A security function can be proved to require authentication while forgetting tenant isolation. Neurosymbolic systems therefore still depend on good specifications, not just good proof search.

Neural networks are useful for proposing promising moves in huge ambiguous spaces; symbolic systems are useful for exact manipulation and verification; neurosymbolic reasoning connects them by letting neural models guide search while symbolic tools enforce correctness; GNNs are relevant because many formal objects are graphs rather than plain text. The frontier is making formal reasoning scalable by wrapping exact checkers in learned search, translation, and repair loops.

Recent work has used graph neural networks to predict branching orders or guide branching decisions. One 2026 paper studies GNN-predicted initial branching orders for CDCL solvers, while earlier NeuroBack-style work plugged a neural heuristic into Kissat and reported solving more SAT Competition problems than the base solver on SATCOMP-2022 and SATCOMP-2023 sets.

The Anthropic papers “Towards monosemanticity” and “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” demonstrate how sparse autoencoders can extract interpretable features from large language models, converting polysemantic neuron activations into monosemantic representations that directly map to identifiable concepts and behaviors. In this writeup I try to and explain the core concepts in this research.

A sparse autoencoder is a neural network designed to learn a compact, interpretablerepresentation of input data by enforcing sparsity on its hidden layer activations. A sparse autoencoder is “sparse” because it applies a constraint during training so that, for any given input, only a small subset of the hidden (latent) units is active (nonzero). This is achieved by adding a sparsity penalty to the loss function, commonly L1 regularization or a KL-divergence term, which discourages most activations from deviating much from zero. This ensures the encoded representation is sparse—meaning only a few features are used to reconstruct the input—resulting in greater interpretability and the extraction of meaningful features. It is an “autoencoder” because the full model is trained end-to-end to reconstruct its own input. The encoder maps the input data to a latent code, and the decoder maps it back to the reconstruction. The central training objective is to minimize reconstruction error, making the network learn to reproduce its input as closely as possible. The difference from other autoencoder types (e.g., vanilla, denoising, variational) is specifically the addition of the sparsity constraint on the hidden code.

An activation is the output value of a neuron or unit in a neural network layer after applying an activation function to a weighted sum of inputs. Mathematically, for a neuron receiving inputs x1,x2,…,xnx1,x2,…,xn with weights w1,w2,…,wnw1,w2,…,wn, the activation is a=f(w1x1+w2x2+⋯+wnxn+b)a=f(w1x1+w2x2+⋯+wnxn+b), where ff is the activation function (such as ReLU, sigmoid, or tanh) and bb is a bias term.

The idea is to view activations as superpositions of underlying features and to use a neural network to reverse the mapping from the activations to the features. This is peering into the workings of an LLM with another neural network to see what the activations mean.

So in the monosemanticity quest, the activations are seen as a superposition of underlying features. A sparse autoencoder decomposes model activations into interpretable features by expressing each activation vector as a sparse linear combination of learned feature directions. Given an activation vector xjxj, the decomposition is:xj≈b+∑ifi(xj)dixj≈b+i∑fi(xj)di where fi(xj)fi(xj) is the activation (magnitude) of feature ii, didi is a unit vector representing the direction of feature ii in activation space, and bb is a bias term. The feature activations are computed by the encoder as fi(x)=ReLU(We(x−bd)+be)ifi(x)=ReLU(We(x−bd)+be)i, where WeWe is the encoder weight matrix and bdbd, bebe are pre-encoder and encoder biases. The feature directions are the columns of the decoder weight matrix WdWd. This formulation is dictionary learning: each activation is reconstructed from a sparse set of learned basis vectors scaled by their respective feature activations.

Acts is short for activations in the above figure of a sparse auto encoder functioning from Anthropic. .

Does the SAE look at all the activations or only certain layers ?

Sparse autoencoders are typically trained on activations from specific layers rather than all layers simultaneously. In practice, a separate SAE is trained for each layer or location in the model where one wishes to analyze or intervene on activations. In Anthropic’s “Scaling Monosemanticity” paper specifically, the SAE was trained only on activations from the residual stream at the middle layer (halfway through Claude 3 Sonnet). This choice was made for several reasons: the residual stream is smaller than the MLP layer, making training and inference computationally cheaper; focusing on the residual stream mitigates “cross-layer superposition,” which refers to neurons whose activations depend on combinations of information across multiple layers; and the middle layer likely contains more interesting and abstract features compared to early layers (which capture basic patterns) or final layers (which may be too task-specific).

Motivation and Definitions

Large language models (LLMs) typically exhibit polysemantic neurons, which activate in response to numerous, often unrelated, concepts, impeding interpretability and safe control.
Monosemanticity refers to representations where each learned feature corresponds to a single, easily identifiable concept, thus improving transparency in model operations.
Sparse autoencoders (SAEs) are employed to learn dictionary-like decompositions of hidden activations, aiming for each basis vector (feature) to align with a distinct semantic unit rather than mixed signals.

Methods and Techniques

The approach uses SAEs to project model activations into higher-dimensional, sparse spaces where individual features become interpretable.
Dictionary learning is central: activations from a given layer are encoded by the SAE so that each dictionary element ideally corresponds to a unique concept or pattern.
Anthropic scales this method from small, shallow models to large networks by training SAEs on billions of activations from state-of-the-art LLMs (e.g., Claude 3 Sonnet).
Modifying feature coefficients within the SAE’s learned space causes proportional, causal shifts in the model’s reconstructed activation, allowing direct steering of outputs at runtime.
Feature steering leverages these interpretable directions to alter specific model behaviors (e.g., changing model goals, tone, biases, or inducing controlled errors) by adjusting activation values during inference.

Results and Empirical Findings

The method yields dictionaries where a substantial portion of features (by human evaluation, approximately 70%) are monosemantic—associated with singular, nameable concepts such as DNA motifs or language script.
Quantitative validation includes human raters agreeing with feature names, decoder-row alignment (cosine similarity > 0.86 between encoder and decoder vectors), and strong compositionality in steering outcomes.
Scaling up the size of the SAE dictionary increases the proportion of monosemantic features and the precision of behavioral interventions.
Interventions using these features show robust control over model outputs, evidenced by targeted behavioral scores and ability to suppress or augment specific behaviors with tunable steering coefficients.

Conceptual Advances

The work empirically supports the superposition hypothesis: raw neurons entangle multiple meanings, but sparse dictionary learning untangles these into separately addressable features.
The method demonstrates that high-dimensional, sparsely coded representations can be extracted at scale without significant algorithmic changes, opening new paths for mechanistic interpretability and control tools in LLMs.
These advances suggest dictionary learning could, in future, replace large fine-tuning campaigns for behavioral adjustments, increase safety monitoring, and allow new forms of user-customized steering.

Activation Steering and Implications

Steering methods operate by selecting, amplifying, or suppressing identified sparse features using signed, tunable coefficients (λλ), with each adjustment reflected directly and causally in output behavior.
The process is mathematically tractable because the SAE remains linear; interventions can be analyzed for causal effects and compositional interactions, which is not feasible in the dense activation spaces of standard LLMs.
This enables multifaceted interventions and targeted control: steering vectors can increase or decrease model propensities for specific behaviors, factuality, style, or compliance in a transparent manner.

Summary Table: Key Terms

Term	Definition
Polysemantic neuron	Neural unit that activates for multiple, unrelated concepts
Monosemantic feature	Basis vector representing a single interpretable concept
Sparse autoencoder	Neural model learning an overcomplete, interpretable dictionary
Dictionary learning	Decomposition of activations into a set of sparse, meaningful vectors
Activation	Output value of a neuron or unit in a neural network layer after applying an activation function to a weighted sum of inputs
Activation steering	Modifying activations using interpretable features to control outputs

This research establishes scalable techniques for extracting and manipulating interpretable features in large LLMs, enabling precise behavioral steering and laying groundwork for safer, more controllable AI deployments.

The sparse autoencoder (SAE) in Anthropic’s “Scaling Monosemanticity” paper was trained at three different scales on activations from Claude 3 Sonnet: approximately 1 million (1,048,576), 4 million (4,194,304), and 34 million (33,554,432) features. For the largest run, the 34M-feature SAE, the number of active (nonzero) features for any given token was typically fewer than 300, showing high sparsity.

The paper emphasizes that many extracted features are relevant to AI safety, such as features for security vulnerabilities, code backdoors, bias (overt and subtle), deception (including power-seeking and treacherous turns), sycophancy, and the generation of dangerous or criminal content. However, the authors note that the detection of such features is preliminary and should not be over-interpreted: knowing about harmful behaviors is distinct from enacting them. The presence of potentially dangerous features suggests the model could represent these concepts internally, warranting deeper investigation. The interpretability gained through the SAE allows for the identification and possible intervention on such features but does not automatically ensure safe model behavior without further work and robust evaluation.

The authors compare their feature-extraction approach to previous interpretability and model-steering methods:

Unlike neuron-centric methods, which often yield tangled, polysemantic activations, SAEs learn overcomplete, sparse dictionaries that approximate monosemantic features.
Their approach leverages scaling laws to optimize both the number of features and training steps, showing that larger SAEs provide more granular, precise, and interpretable decompositions than smaller or denser models.
The SAE-based approach allows for explicit, steerable interventions by clamping or zeroing specific features, something not possible with conventional dense neuron manipulation.
The paper positions this technique as extensible, mechanistically transparent, and a foundation for scalable model interpretability—offering capabilities not matched by most prior strategies.

These results highlight that scalable, sparse autoencoders produce directly actionable, interpretable features offering new tools for AI safety and more precise model control compared to traditional neuron or layerwise interpretability approaches.

An argument on the urgency of interpretability: https://www.darioamodei.com/post/the-urgency-of-interpretability

Neel Nanda’s replication of results has a notebook for going deeper. https://www.alignmentforum.org/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s

Secure Machinery

On the evolution of security and intelligent machinery

Tag: mechanistic interpretability

Neurosymbolic reasoning

Anthropic: Activations to Interpretable features with Monosemanticity