Loss functions and optimizers

I was reading this excellent explanation of Adam optimizer – https://towardsdatascience.com/understanding-deep-learning-optimizers-momentum-adagrad-rmsprop-adam-e311e377e9c2/ , and while looking for the latest in understanding the derivatives and curvatures of the loss functions, stumbled upon this paper “Towards Quantifying the Hessian Structure of Neural Networks“, which I think is incisive. Here’ s a brief summary.

Modern large language models have staggering vocabularies that fundamentally shape their optimization landscape. Llama 2 uses 32,000 tokens, while Llama 3 and DeepSeek-V3 scale up to 128,000 tokens. This massive number of output classes creates a remarkable mathematical property: the Hessian matrix becomes extraordinarily block-diagonal. The theoretical guarantee shows that the ratio of off-diagonal to diagonal block magnitudes scales as HoffFHdiagF=O(1/C)=O(104) meaning the off-diagonal coupling is four orders of magnitude smaller than the diagonal curvature. This prediction aligns with empirical observations across multiple architectures including GPT-2, various Transformer models, and OPT-125M, confirming that the theoretical framework captures something fundamental about how LLMs are structured.

The effectiveness of Adam, arguably the most widely used optimizer for training large models, becomes transparent when viewed through this lens. Adam’s update rule works by approximating a diagonal preconditioner: θt+1=θtαmtvt+ϵθt+1=θt−αvt+ϵmt, where vtdiag(2L)vt≈diag(∇2L) estimates only the diagonal of the Hessian. For a typical dense matrix, ignoring everything except the diagonal would be catastrophic. But when the Hessian is block-diagonal with many blocks (large CC), the situation changes dramatically. The full Hessian can be approximated as Hblock-diag(H1,,HC,HC+1,,HC+m)H≈block-diag(H1,…,HC,HC+1,…,HC+m), and the error from using only the diagonal becomes Hdiag(H)FHF=O(1/C)∥H∥F∥H−diag(H)∥F=O(1/C), which for modern LLMs is negligible. Zhang et al. (2024) made this connection explicit through empirical analysis of blockwise Hessian spectra in transformer models, and leveraged this insight to design Adam-mini, an optimizer that maintains training quality while reducing optimizer memory consumption by 50% by computing diagonal second moments per block rather than per parameter.

The Muon optimizer, developed by Jordan et al. (2024), represents a more aggressive exploitation of block-diagonal structure through block-wise orthogonalization applied to matrix parameters: Wt+1=Wtαorthogonalize(LW)Wt+1=Wt−α⋅orthogonalize(∇LW). The theoretical justification is elegant: because each weight matrix experiences approximately independent curvature due to the block-diagonal Hessian structure, the orthogonalization operation functions as a Newton-like step applied independently to each block. This geometric insight has proven remarkably effective in practice, successfully training large models including Moonlight, Kimi-K2, and GLM-4.5. Recent convergence analysis by An et al. (2025) confirms that Muon’s performance gains are specifically driven by the combination of low-rank weight matrices and block-diagonal Hessian structure, showing that the optimizer isn’t merely exploiting an empirical trick but rather leveraging a profound structural property of the loss landscape that emerges from the number of output classes.

Leave a comment