Loss functions and optimizers – Adam and Muon and the Hessian of the loss function.

I was reading this explanation of the evolution of the Adam optimizer – https://towardsdatascience.com/understanding-deep-learning-optimizers-momentum-adagrad-rmsprop-adam-e311e377e9c2/ which makes a point that knowledge of previous gradients and not just the current gradient is helpful in faster convergence and that this has lead to the design of better optimizers (specifically Adam or adaptive momentum estimation uses Momentum and RMSProp which keep histories of both gradient and squared gradient using moving averages). This points to the fact that we are looking for the derivative of the gradient, or the curvature of the loss function, also called the Jacobian of the gradient or the Hessian. And so while looking for the latest on the topic of gradients and curvatures of loss functions, I stumbled upon the paper “Towards Quantifying the Hessian Structure of Neural Networks“, which I think is incisive. Here’ s a brief summary.

You’ve probably heard that neural networks are black boxes. But there’s one mathematical structure inside them that’s starting to reveal itself: their Hessian matrices—the second derivatives of the loss function with respect to weights. For the past twenty years, researchers noticed something curious: these matrices have a peculiar organization. They’re almost entirely concentrated in blocks along the diagonal.. A new paper by Dong, Zhang, Yao, and Sun explains why this happens, and the answer is: it has to do with the number of classes in your classification problem, and not the math of cross-entropy loss as previously thought. This finding has practical consequences for how we train large language models.

Modern large language models have large vocabularies that shape their optimization landscape. Llama 2 uses 32,000 tokens, while Llama 3 and DeepSeek-V3 scale up to 128,000 tokens. This large number of output classes creates a remarkable property: the Hessian matrix becomes block-diagonal. In particular they show that the ratio between the Frobenius norm of the off-diagonal blocks and that of the diagonal blocks scales as $O (1 / C)$ , where $C$ is the number of classes. If C is 10k, $\frac{∥ H_{off} ∥_{F}}{∥ H_{diag} ∥_{F}} = O (1 / C) = O (10^{- 4})$ meaning the off-diagonal coupling is four orders of magnitude smaller than the diagonal curvature. This prediction aligns with empirical observations across multiple architectures including GPT-2, various Transformer models, and OPT-125M, confirming that the theoretical framework captures something fundamental about how LLMs are structured.

The effectiveness of Adam, currently the most widely used optimizer for training large models, becomes easy to understand when viewed through this lens. Adam’s update rule works by approximating a diagonal preconditioner: $θ_{t + 1} = θ_{t} - α \frac{m_{t}}{\sqrt{v_{t} + ϵ}}$ , where $v_{t} \approx diag (\nabla^{2} L)$ estimates only the diagonal of the Hessian. A preconditioner is any matrix $P$ that transforms the gradient before applying the update, effectively changing the metric in parameter space. In Adam, the update has the form $Δ θ_{t} = - α P_{t} g_{t}$ , where $P_{t} = d i a g (1 / (\sqrt{v_{t}} + ϵ))$ . This matrix is diagonal, meaning that each parameter is rescaled independently, with no off-diagonal terms to model interactions or correlations between parameters. In contrast, a full Newton or natural-gradient method would use a dense matrix that mixes directions according to curvature or information geometry. For a typical dense matrix, ignoring everything except the diagonal would not be effective. But when the Hessian is block-diagonal with many blocks (large $C$ C), the situation changes and the full Hessian can be approximated as $H \approx block-diag (H_{1}, \dots, H_{C}, H_{C + 1}, \dots, H_{C + m})$ , and the error from using only the diagonal becomes $\frac{∥ H - diag (H) ∥_{F}}{∥ H ∥_{F}} = O (1 / C)$ , which for modern LLMs is negligible. Zhang et al. (2024) made this connection explicit through empirical analysis of blockwise Hessian spectra in transformer models, and leveraged this insight to design Adam-mini, an optimizer that maintains training quality while reducing optimizer memory consumption by 50% by computing diagonal second moments per block rather than per parameter.

The Muon optimizer, developed by Jordan et al. (2024), represents a more aggressive exploitation of block-diagonal structure through block-wise orthogonalization applied to matrix parameters: $W_{t + 1} = W_{t} - α \cdot orthogonalize (\nabla L_{W})$ . The theoretical justification is elegant: because each weight matrix experiences approximately independent curvature due to the block-diagonal Hessian structure, the orthogonalization operation functions as a Newton-like step applied independently to each block. This geometric insight has proven remarkably effective in practice, successfully training large models including Moonlight, Kimi-K2, and GLM-4.5. Recent convergence analysis by An et al. (2025) confirms that Muon’s performance gains are specifically driven by the combination of low-rank weight matrices and block-diagonal Hessian structure, showing that the optimizer isn’t merely exploiting an empirical trick but rather leveraging a profound structural property of the loss landscape that emerges from the number of output classes.

Some definitions of terms used:

A neural network for our purpose is a function which maps a set of inputs to a set of outputs, using a set of parameters θ, which control the mapping. We train the network to arrive at a set of parameters θ which generalizes the function from the input set to output set – reduces the error without overfitting. The size of the set of parameters – let’s call it N.

A loss function L(θ)∈R is a single number that tells the neural network how wrong it is. Near a good solution, the loss behaves like a quadratic RMS form: it looks like a bowl. Mathematically, this means the loss can be approximated as a multidimensional parabola. A Taylor series expansion of L around a point θ* looks as follows:

L(θ)=L(θ*)+(θ−θ*)⊤∇L(θ*)+1/2(θ−θ*)⊤∇2L(θ*)(θ−θ*)+higher-order terms.

The gradient of the loss function L(θ) is a vector that tells the network which direction to move the parameters to reduce the loss fastest. If there are N parameters, the gradient is a vector of size N. The gradient norm is the length of the gradient vector – it’s a scalar. If the loss is a bowl, the gradient is the arrow pointing up the steepest side of the bowl. The gradient points uphill; training moves in the opposite direction. Near a minimum, the linear gradient term vanishes as the derivative is zero, and this reduces to the following form

L(θ)≈L(θ*)+1/2(θ−θ*)⊤H(θ−θ*), where H=∇2L(θ*)

The Jacobian (Rn -> Rnxn) is a derivative of a vector valued function. Since the gradient is a vector (of size N), we can take its Jacobian (size NxN). This Jacobian of the gradient of L(θ) is called the Hessian of L(θ). Geometrically, this approximation says that near θ*, the loss surface looks like a multidimensional paraboloid. The Hessian determines the shape of that paraboloid. If $H$ has large eigenvalues in some directions, the parabola is steep in those directions. If it has small eigenvalues, the surface is flat along those directions. If any eigenvalues are negative, then θ* is not a minimuml but a saddle point, because the parabola opens downward in those directions. This Hessian having a block diagonal structure can be interpreted as the Loss function having multiple paraboloids which can be independently minimized .

Secure Machinery

On the evolution of security and intelligent machinery

Loss functions and optimizers – Adam and Muon and the Hessian of the loss function.

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply