Information geometry (and model interventions)

Information theory is fundamentally based on probability theory and statistics. It uses concepts like entropy to quantify uncertainty. For a binary variable x that takes on values 0 and 1, with distribution parameter such that P(x) = 1 is p, P(x)=0 is 1-p , the entropy is highest at p=0.5. In this example we see a number of distributions possible with the single parameter p.

Information geometry considers the distribution changes as the parameters of the distributions change. It is mainly about (i) how probability distributions change, (ii) how efficiently you can estimate distribution parameters, and (iii) how training behaves locally near a solution. The appearance of intelligent behavior is more about the representational capacity of deep architectures, the inductive biases introduced by attention and composition across layers, and the structure of the data and objective. Geometry can describe the landscape you move on.

When a neural network is trained twice, it can get different weights in each run. This “train twice, get different weights”: is expected because the parameterization has many symmetries and flat directions (different theta can implement nearly the same function), and Stochastic Gradient Descent noise plus nonconvexity selects different minima. The information-geometric way to make this consistent is to stop treating “the weights” as the primary object and instead treat the induced distribution p_theta() as primary. Two runs can yield very different weights but still end up close in distribution space, meaning small KL distance or small Fisher–Rao distance. In other words, consistency is better assessed in the space Rao cares about: the manifold of model distributions, not Euclidean parameter space.

If you want training to be more consistent across runs, there are geometry-aligned options: use constraints measured in KL (trust-region style updates), use natural-gradient/Fisher-preconditioned updates (or approximations such as K-FAC), and evaluate or regularize solutions by function-space distances (e.g., output KL on a probe set) rather than weight distances. These methods do not force identical weights, but they make the learning dynamics and the endpoint more invariant to reparameterization, which is the core “Rao compatibility” criterion.

Conceptual ladder

1 Measurement produces probabilistic data.

2 Estimating physical quantities is an inference problem.

3 Fisher information quantifies distinguishability.

4 The Cramér–Rao bound limits estimation precision. (CRB)

5 Quantum mechanics constrains probability models via wavefunctions.

6 Fourier duality links position and momentum information.

7 The Heisenberg principle is a Cramér–Rao bound under quantum constraints.

8 The Quantum Cramér–Rao bound is the sharpest possible version of this limit.

9 Rao–Blackwellization explains why optimal measurements achieve these bounds.

References :

1 B. R. Frieden, Physics from Fisher Information, Cambridge University Press (1998)
https://doi.org/10.1017/CBO9780511622625

A transformer is fundamentally a conditional probability machine, not just a matrix stack. Every layer ultimately serves the objective p_\theta(x_t \mid x_{<t}), and training minimizes KL/cross-entropy between the model distribution and the data distribution. Practically, this means many seemingly different tricks—temperature, label smoothing, RLHF KL penalties, calibration, beam search, speculative decoding—are all manipulating or constraining probability distributions.

When debugging or improving a model, think first in terms of “what distribution is this network assigning probability mass to?” rather than “what are the raw weights doing?” The Fisher/Hessian viewpoint gives a much better mental model of training dynamics than raw Euclidean gradients. Two parameter changes of equal size can alter the model’s behavior by vastly different amounts. Fisher information measures how sensitive the output distribution is to parameter changes. In practice, curvature-aware methods—natural gradient ideas, K-FAC, Shampoo, Adam-style preconditioning, trust-region/KL-constrained updates—work because they partially respect the geometry of the model distribution rather than blindly following coordinate gradients. A useful practical heuristic is: compare models and updates in function/output space (KL, logits, predictions), not weight-space distance. The internal structure of neural nets is highly non-unique, so interpretability should focus on stable functional/computational patterns, not exact neurons or weights. Training twice often yields different weights because many parameter settings implement nearly the same distribution/function. Attention heads, residual streams, and MLP neurons are therefore best viewed as approximate computational subcircuits rather than fixed semantic modules. Practically, when “hacking” or steering models, interventions that act on representations, activations, logits, or attention patterns are often more robust than interventions tied to exact parameter identities.

Secure Machinery

On the evolution of security and intelligent machinery

Information geometry (and model interventions)

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply