Information geometry, Cramér–Rao/Fisher metrics, and Laplace approximations are mainly about (i) how distributions change, (ii) how efficiently you can estimate parameters, and (iii) how training behaves locally near a solution. The appearance of intelligent behavior is more about the representational capacity of deep architectures, the inductive biases introduced by attention and composition across layers, and the structure of the data and objective. Geometry can describe the landscape you move on; it does not by itself generate the algorithms the network learns.
On “train twice, get different weights”: this is expected because the parameterization has many symmetries and flat directions (different \theta can implement nearly the same function), and SGD noise plus nonconvexity selects different minima. The information-geometric way to make this consistent is to stop treating “the weights” as the primary object and instead treat the induced distribution p_\theta(\cdot) (or the function it computes) as primary. Two runs can yield very different weights but still end up close in distribution space, meaning small KL distance or small Fisher–Rao distance. In other words, consistency is better assessed in the space Rao cares about: the manifold of model distributions, not Euclidean parameter space.
If you want training to be more consistent across runs, there are geometry-aligned options: use constraints measured in KL (trust-region style updates), use natural-gradient/Fisher-preconditioned updates (or approximations such as K-FAC), and evaluate or regularize solutions by function-space distances (e.g., output KL on a probe set) rather than weight distances. These methods do not force identical weights, but they make the learning dynamics and the endpoint more invariant to reparameterization, which is the core “Rao compatibility” criterion.
Conceptual summary ladder
1 Measurement produces probabilistic data.
2 Estimating physical quantities is an inference problem.
3 Fisher information quantifies distinguishability.
4 The Cramér–Rao bound limits estimation precision. (CRB)
5 Quantum mechanics constrains probability models via wavefunctions.
6 Fourier duality links position and momentum information.
7 The Heisenberg principle is a CRB under quantum constraints.
8 The Quantum CRB is the sharpest possible version of this limit.
9 Rao–Blackwellization explains why optimal measurements achieve these bounds.
References (working links)
1 B. R. Frieden, Physics from Fisher Information, Cambridge University Press (1998)
https://doi.org/10.1017/CBO9780511622625
A transformer is fundamentally a conditional probability machine, not just a matrix stack. Every layer ultimately serves the objective p_\theta(x_t \mid x_{<t}), and training minimizes KL/cross-entropy between the model distribution and the data distribution. Practically, this means many seemingly different tricks—temperature, label smoothing, RLHF KL penalties, calibration, beam search, speculative decoding—are all manipulating or constraining probability distributions. When debugging or improving a model, think first in terms of “what distribution is this network assigning probability mass to?” rather than “what are the raw weights doing?” The Fisher/Hessian viewpoint gives a much better mental model of training dynamics than raw Euclidean gradients. Two parameter changes of equal size can alter the model’s behavior by vastly different amounts. Fisher information measures how sensitive the output distribution is to parameter changes. In practice, curvature-aware methods—natural gradient ideas, K-FAC, Shampoo, Adam-style preconditioning, trust-region/KL-constrained updates—work because they partially respect the geometry of the model distribution rather than blindly following coordinate gradients. A useful practical heuristic is: compare models and updates in function/output space (KL, logits, predictions), not weight-space distance. The internal structure of neural nets is highly non-unique, so interpretability should focus on stable functional/computational patterns, not exact neurons or weights. Training twice often yields different weights because many parameter settings implement nearly the same distribution/function. Attention heads, residual streams, and MLP neurons are therefore best viewed as approximate computational subcircuits rather than fixed semantic modules. Practically, when “hacking” or steering models, interventions that act on representations, activations, logits, or attention patterns are often more robust than interventions tied to exact parameter identities.