Learned Representations in Neural Networks

Neural networks transform raw inputs — pixels, text, audio — into internal descriptions built layer by layer through learned weights and nonlinearities. The core mechanism is hierarchical composition: early layers detect local patterns like edges or n-gram features, while deeper layers combine these into abstract structures like object parts, semantic concepts, or reasoning patterns. Rather than relying on hand-engineered features, the network discovers whatever internal geometry best serves its training objective.

Representation spaces are not mere lookup tables; they are high-dimensional manifolds with structure that can be analyzed with the tools of differential geometry and information geometry. The Fisher information metric, for instance, naturally measures distances between probability distributions that a network implicitly encodes, connecting the curvature of representation space to the model’s sensitivity and generalization behavior.

More visibly, semantic relationships in language models manifest as linear directions in activation space, enabling vector arithmetic over meaning. This regularity reflects the network solving a smooth optimization problem in which nearby inputs on the data manifold are mapped to nearby points in representation space.

A critical consequence of this structure is transferability. Representations learned on large datasets tend to capture the intrinsic geometry of the data distribution itself, making them reusable across tasks. This underpins the modern pretrain-and-adapt paradigm: a foundation model distills general representational structure from vast data, and fine-tuning merely redirects it.
Interpretability research has complicated this picture. Networks appear to use superposition, encoding more features than they have dimensions by distributing concepts across overlapping, near-orthogonal directions rather than isolated neurons. This is geometrically efficient — nearly orthogonal vectors in high dimensions allow exponentially many features to coexist — but it makes the representation space harder to read.

a model now requires studying directions, circuits, and geodesics in activation space, not individual units. This is the project of mechanistic interpretability: recovering the internal computational geometry that produces a model’s behavior.

Three frontiers concentrate current research. First, what geometric properties of a representation predict its generalizability — smoothness, dimensionality, curvature of the learned manifold? Second, how do large language models encode causal relations, abstractions, and multi-step reasoning, and does this reflect genuine geometric structure or brittle surface statistics? Third, can training objectives be designed to produce representations that are sparse, disentangled, or causally structured by construction — making the geometry legible from the start rather than reverse-engineered after the fact? This last question connects representation learning directly to AI safety: systems whose internal geometry can be inspected and tested are systems whose behavior can actually be understood.

Examples of these three frontiers.

1) Generalization of representations
The clearest example is CLIP, which learns a joint image-text embedding by aligning representations across modalities. Its learned geometry transfers remarkably to tasks it never saw — zero-shot classification, image retrieval, robotic perception — suggesting it captured something close to the intrinsic manifold of visual concepts rather than task-specific shortcuts. Studying why it transfers (low intrinsic dimensionality? smooth curvature? alignment with human semantic structure?) is an open and active question.
2)Reasoning structure in language models
Anthropic’s “Scaling and evaluating sparse autoencoders” work, along with follow-on mechanistic interpretability research, has found evidence that models trained purely on next-token prediction develop internal representations of entity states, spatial relations, and multi-step dependencies — structures that look suspiciously like world models. The cleaner controlled example is othello-GPT (Nanda et al.), where a transformer trained only on legal move sequences was shown to linearly represent the board state internally, a clean demonstration that reasoning-like geometric structure emerges without explicit supervision.
3) More interpretable representations
β-VAEs are the canonical attempt: penalizing the KL term forces the latent space toward an axis-aligned, disentangled geometry where individual dimensions correspond to independent generative factors. The result is representations where traversing a single latent direction changes exactly one attribute — pose, lighting, shape — leaving others fixed. The limitation is that disentanglement defined this way is coordinate-dependent and doesn’t guarantee causal structure, which has pushed more recent work toward causal representation learning (Schölkopf et al.) as the right geometric target.

Secure Machinery

On the evolution of security and intelligent machinery

Learned Representations in Neural Networks

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply