Toy models of superposition – Anthropic paper summary

The paper studies a deliberately simple model in order to isolate one question: how many features a network can represent when space is limited. The authors use a small ReLU network trained on synthetic data constructed from independent features. There are (n) possible features, but each data point activates only a sparse subset of them, with sparsity controlled explicitly. The network compresses these inputs into a lower-dimensional hidden space of size (m) and then reconstructs or predicts from that compressed representation. The ReLU nonlinearity is essential, because it can act as a gate that suppresses inactive features and reduces interference when multiple features are mapped into the same dimension.

A simple example, where five features are forced into two dimensions, illustrates the central phenomenon. When features are dense, the model behaves like PCA and keeps only the most important directions while discarding the rest. When features are sparse, the model instead preserves more features by allowing them to coexist in the same dimensions. This coexistence is what the paper calls superposition.

The authors show that superposition is not a gradual effect but appears as a regime change. As feature sparsity increases, or as the ratio between features and dimensions changes, the optimal solution flips. In one regime, a small number of features are represented in nearly orthogonal directions, while others are dropped. In the other regime, many features are represented in fewer dimensions, with interference that remains tolerable because features are rarely active at the same time. The paper argues that this transition follows directly from the optimization problem and should be expected whenever sparse features compete for limited representational capacity.

The geometry of these representations is not arbitrary. In symmetric toy setups, the learned feature directions arrange themselves into regular geometric configurations such as pairs, triangles, pentagons, or tetrahedra. These arrangements resemble classical sphere-packing or code-design solutions, where vectors are placed to minimize mutual interference. The contribution here is to show that such structured packings emerge naturally from gradient descent in a simple neural network, without being imposed by design.

A common concern is that superposition might support storage but not computation. The paper addresses this by demonstrating simple computations, such as absolute value, that can be carried out while features remain in superposition. This leads to the view that real networks may behave like noisy simulations of larger sparse networks, preserving the same set of features but compressing them into fewer dimensions and tolerating some interference. From an interpretability perspective, this suggests that understanding a model requires recovering the underlying feature basis, rather than expecting individual neurons to align cleanly with single concepts.

The authors connect this picture to ideas from compressed sensing, where sparse signals can be recovered from low-dimensional projections under appropriate incoherence conditions and where phase transitions are also common. They also speculate about links to adversarial vulnerability and training dynamics such as grokking, though these connections are presented as early evidence rather than established theory.

The paper naturally aligns with earlier work by Olshausen and Field on sparse coding in vision. In that work, sparsity applies to activity: each image is represented using only a small number of active coefficients, which allows an overcomplete dictionary to exist without excessive interference. In the superposition setting, sparsity applies instead to feature occurrence: most features are inactive most of the time, so collisions in shared dimensions are rare enough to be acceptable. Both frameworks rely on the same trade-off between utility and interference. Sparse coding seeks a dictionary that makes coefficients mostly independent, while superposition accepts interference in a compressed representation and relies on nonlinear gating to suppress it when features are absent. Both lead to the same interpretive conclusion: meaningful structure lives in the right representational basis, not in individual neurons.

In short, Olshausen and Field show that sparsity in activity enables the learning of a structured dictionary. The superposition paper shows that sparsity in feature occurrence enables packing a larger dictionary into fewer dimensions.

Leave a comment