Preprint
- Mana Sakai and Masaaki Imaizumi. (2026). Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers. arXiv:2605.07297.
TL;DR: We derive post hoc generalization bounds for deep Transformers that adapt to the learned singular-value spectra of each layer, yielding slower depth and width dependence than existing norm-based boundsAbstract
Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.
Publications
Mana Sakai, Ryo Karakida, and Masaaki Imaizumi. (2025). Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs. In Advances in Neural Information Processing Systems (NeurIPS), Volume 38, pp. 35630-35664.
[Paper, arXiv, Code, Poster]
TL;DR: We rigorously identify the infinite–width limit distribution of neurons within a single attention layer under realistic architectural dimensionalityAbstract
In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard \(1/\sqrt{n}\)-scaling with \(n\) dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.
Mana Sakai, Takeru Matsuda, and Tatsuya Kubokawa. (2025). Priors for second-order unbiased Bayes estimators. Biometrika, 112(4), asaf068.
[Paper, arXiv, Code, Slides (in Japanese), Poster]
TL;DR: We derive priors for second-order unbiased Bayes estimators in non-i.i.d. modelsAbstract
Asymptotically unbiased priors, introduced by Hartigan (1965), are designed to achieve second-order unbiasedness of Bayes estimators. This paper extends Hartigan’s framework to non-i.i.d. models by deriving a system of partial differential equations that characterizes asymptotically unbiased priors. Furthermore, we establish a necessary and sufficient condition for the existence of such priors and propose a simple procedure for constructing them.
The proposed method is applied to several examples, including the linear regression model and the nested error regression (NER) model (also known as the random effects model). Simulation studies evaluate the frequentist properties of the Bayes estimator under the asymptotically unbiased prior for the NER model, highlighting its effectiveness in small-sample settings.[1] Hartigan, J. A. (1965). The asymptotically unbiased prior distribution. The Annals of Mathematical Statistics 36(4), 1137–1152.
International Presentations
- Mana Sakai, Ryo Karakida, and Masaaki Imaizumi. Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, USA, 12/2–7, 2025.
- Mana Sakai, Takeru Matsuda, and Tatsuya Kubokawa. Priors for second-order unbiased Bayes estimators. Objective Bayes Methodology Conference 2025, Athens, Greece, 6/8–12, 2025.
Domestic Presentations
- 酒井真菜,唐木田亮,今泉允聡.テンソルプログラムを用いた単一アテンション層の無限幅極限の解析.第28回情報論的学習理論ワークショップ (IBIS2025),那覇文化芸術劇場なは一と,11/12–15,2025.
- 酒井真菜,唐木田亮,今泉允聡.テンソルプログラムを用いた単一アテンション層の無限幅極限の解析.2025年度統計関連学会連合大会,関西大学,9/7–9/11,2025.
- 酒井真菜,唐木田亮,今泉允聡.テンソルプログラムの拡張によるアテンション機構の無限幅極限の漸近的解析.2024年度統計関連学会連合大会,東京理科大学,9/1–9/5,2024.
- 酒井真菜,松田孟留,久保川達也.2次不偏なベイズ推定量の導出とその性質,2024年度統計関連学会連合大会,東京理科大学,9/1–9/5,2024.
- 酒井真菜,松田孟留,久保川達也.2次不偏なベイズ推定量の導出とその性質,日本計算機統計学会第38 回大会,山形,5/23–5/25,2024.
- 酒井真菜,松田孟留,久保川達也.2次不偏なベイズ推定量の導出とその性質,応用統計学会2024年度年会,九州大学,5/23–5/9,2024.