1932

Abstract

In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics, and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression. These results rely on explicit constructions of neural networks, leading to fast convergence rates of excess risks. Nonetheless, their underlying analysis only applies to the global minimizer in the highly nonconvex landscape of deep neural networks. This motivates us to review the training dynamics of neural networks in the second part. Specifically, we review articles that attempt to answer the question of how a neural network trained via gradient-based methods finds a solution that can generalize well on unseen data. In particular, two well-known paradigms are reviewed: the neural tangent kernel and mean-field paradigms. Last, we review the most recent theoretical advancements in generative models, including generative adversarial networks, diffusion models, and in-context learning in large language models from two of the same perspectives, approximation and training dynamics.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040522-013920
2025-03-07
2025-06-16
Loading full text...

Full text loading...

/deliver/fulltext/statistics/12/1/annurev-statistics-040522-013920.html?itemId=/content/journals/10.1146/annurev-statistics-040522-013920&mimeType=html&fmt=ahah

Literature Cited

  1. Akyürek E, Schuurmans D, Andreas J, Ma T, Zhou D. 2023.. What learning algorithm is in-context learning? Investigations with linear models. . arXiv:2211.15661v3 [cs.LG]
  2. Allen-Zhu Z, Li Y. 2019.. What can ResNet learn efficiently, going beyond kernels?. In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 901728. Red Hook, NY:: Curran
    [Google Scholar]
  3. Allen-Zhu Z, Li Y, Liang Y. 2019a.. Learning and generalization in overparameterized neural networks, going beyond two layers. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 615869. Red Hook, NY:: Curran
    [Google Scholar]
  4. Allen-Zhu Z, Li Y, Song Z. 2019b.. A convergence theory for deep learning via over-parameterization. . arXiv:1811.03962 [cs.LG]
  5. Anderson BD. 1982.. Reverse-time diffusion equation models. . Stochast. Process. Appl. 12:(3):31326
    [Crossref] [Google Scholar]
  6. Arora S, Du SS, Hu W, Li Z, Salakhutdinov RR, Wang R. 2019a.. On exact computation with an infinitely wide neural net. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 814150. Red Hook, NY:: Curran
    [Google Scholar]
  7. Arora S, Du SS, Hu W, Li Z, Wang R. 2019b.. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. . Proc. Mach. Learn. Res. 97::32232
    [Google Scholar]
  8. Arora S, Ge R, Liang Y, Ma T, Zhang Y. 2017.. Generalization and equilibrium in generative adversarial nets (GANs). . Proc. Mach. Learn. Res. 70::22432
    [Google Scholar]
  9. Ba J, Erdogdu MA, Suzuki T, Wang Z, Wu D. 2024.. Learning in the presence of low-dimensional structure: a spiked random matrix perspective. . In NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems, ed. A Oh, T Naumann, A Globerson, K Saenko, M Hardt, S Levine , pp. 1742049. Red Hook, NY:: Curran
    [Google Scholar]
  10. Ba J, Erdogdu MA, Suzuki T, Wang Z, Wu D, Yang G. 2022.. High-dimensional asymptotics of feature learning: how one gradient step improves the representation. . In NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems, ed. S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, A Oh , pp. 3793246. Red Hook, NY:: Curran
    [Google Scholar]
  11. Bai Y, Chen F, Wang H, Xiong C, Mei S. 2023.. Transformers as statisticians: provable in-context learning with in-context algorithm selection. . arXiv:2306.04637 [cs.LG]
  12. Bai Y, Lee JD. 2020.. Beyond linearization: on quadratic and higher-order approximation of wide neural networks. Paper presented at the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia:, Apr. 30
    [Google Scholar]
  13. Bai Y, Ma T, Risteski A. 2019.. Approximability of discriminators implies diversity in GANs. . arXiv:1806.10586 [cs.LG]
  14. Bakry D, Gentil I, Ledoux M. 2014.. Analysis and Geometry of Markov Diffusion Operators. New York:: Springer
    [Google Scholar]
  15. Barron AR. 1993.. Universal approximation bounds for superpositions of a sigmoidal function. . IEEE Trans. Inform. Theory 39:(3):93045
    [Crossref] [Google Scholar]
  16. Barron AR. 1994.. Approximation and estimation bounds for artificial neural networks. . Mach. Learn. 14::11533
    [Google Scholar]
  17. Bartlett PL, Harvey N, Liaw C, Mehrabian A. 2019.. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. . J. Mach. Learn. Res. 20:(1):2285301
    [Google Scholar]
  18. Bartlett PL, Montanari A, Rakhlin A. 2021.. Deep learning: a statistical viewpoint. . Acta Numer. 30::87201
    [Crossref] [Google Scholar]
  19. Bauer B, Kohler M. 2019.. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. . Ann. Stat. 47:(4):226185
    [Crossref] [Google Scholar]
  20. Belkin M. 2021.. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. . Acta Numer. 30::20348
    [Crossref] [Google Scholar]
  21. Bengio Y, Roux N, Vincent P, Delalleau O, Marcotte P. 2005.. Convex neural networks. . In NIPS'05: Proceedings of the 18th International Conference on Neural Information Processing Systems, ed. Y Weiss, B Schlkopf, JC Platt , pp. 12330. Red Hook, NY:: Curran
    [Google Scholar]
  22. Bietti A, Mairal J. 2019.. On the inductive bias of neural tangent kernels. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 12893904. Red Hook, NY:: Curran
    [Google Scholar]
  23. Blanchard M, Bennouna MA. 2022.. Shallow and deep networks are near-optimal approximators of Korobov functions. Paper presented at the International Conference on Learning Representations (ICLR 2022), virtual, Apr. 25
    [Google Scholar]
  24. Block A, Mroueh Y, Rakhlin A. 2020.. Generative modeling with denoising auto-encoders and Langevin sampling. . arXiv:2002.00107 [stat.ML]
  25. Cao Y, Fang Z, Wu Y, Zhou DX, Gu Q. 2020.. Towards understanding the spectral bias of deep learning. . arXiv:1912.01198 [cs.LG]
  26. Cao Y, Gu Q. 2019.. Generalization bounds of stochastic gradient descent for wide and deep neural networks. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 1083646. Red Hook, NY:: Curran
    [Google Scholar]
  27. Chen M, Huang K, Zhao T, Wang M. 2023.. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. . arXiv:2302.07194 [cs.LG]
  28. Chen M, Jiang H, Liao W, Zhao T. 2022a.. Nonparametric regression on low-dimensional manifolds using deep ReLU networks: function approximation and statistical recovery. . Inf. Inference 11:(4):120353
    [Google Scholar]
  29. Chen M, Liao W, Zha H, Zhao T. 2022b.. Distribution approximation and statistical estimation guarantees of generative adversarial networks. . arXiv:2002.03938 [cs.LG]
  30. Chen RJ, Lu MY, Chen TY, Williamson DF, Mahmood F. 2021.. Synthetic data in machine learning for medicine and healthcare. . Nat. Biomed. Eng. 5:(6):49397
    [Crossref] [Google Scholar]
  31. Chen S, Chewi S, Lee H, Li Y, Lu J, Salim A. 2023a.. The probability flow ODE is provably fast. . arXiv:2305.11798 [cs.LG]
  32. Chen S, Chewi S, Li J, Li Y, Salim A, Zhang AR. 2023b.. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. . arXiv:2209.11215 [cs.LG]
  33. Chen S, Sheen H, Wang T, Yang Z. 2024.. Training dynamics of multi-head softmax attention for in-context learning: emergence, convergence, and optimality. . arXiv:2402.19442 [cs.LG]
  34. Chen Z, Cao Y, Gu Q, Zhang T. 2020.. A generalized neural tangent kernel analysis for two-layer neural networks. . In NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems, ed. H Larochelle, M Ranzato, R Hadsell, MF Balcan, H. Lin , pp. 1336373. Red Hook, NY:: Curran
    [Google Scholar]
  35. Chizat L, Bach F. 2018.. On the global convergence of gradient descent for over-parameterized models using optimal transport. . In NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, ed. S Bengio, HM Wallach, H Larochelle, K Grauman, N Cesa-Bianchi , pp. 304050. Red Hook, NY:: Curran
    [Google Scholar]
  36. Cybenko G. 1989.. Approximation by superpositions of a sigmoidal function. . Math. Control Signals Syst. 2:(4):30314
    [Crossref] [Google Scholar]
  37. Dai D, Sun Y, Dong L, Hao Y, Sui Z, Wei F. 2023.. Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers. . arXiv:2212.10559 [cs.CL]
  38. De Bortoli V. 2023.. Convergence of denoising diffusion models under the manifold hypothesis. . arXiv:2208.05314 [stat.ML]
  39. De Bortoli V, Thornton J, Heng J, Doucet A. 2021.. Diffusion Schrödinger bridge with applications to score-based generative modeling. . In NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems, ed. M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan , pp. 17695709. Red Hook, NY:: Curran
    [Google Scholar]
  40. Delalleau O, Bengio Y. 2011.. Shallow versus deep sum-product networks. . In NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems, ed. J Shawe-Taylor, RS Zemel, PL Bartlett, F Pereira, KQ Weinberger , pp. 66674. Red Hook, NY:: Curran
    [Google Scholar]
  41. Devlin J, Chang MW, Lee K, Toutanova K. 2018.. BERT: pre-training of deep bidirectional transformers for language understanding. . arXiv:1810.04805 [cs.CL]
  42. DeVore R, Hanin B, Petrova G. 2021.. Neural network approximation. . Acta Numer. 30::327444
    [Crossref] [Google Scholar]
  43. Dhariwal P, Nichol A. 2021.. Diffusion models beat GANs on image synthesis. . In NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems, ed. M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan , pp. 878094. Red Hook, NY:: Curran
    [Google Scholar]
  44. Dong Q, Li L, Dai D, Zheng C, Wu Z, et al. 2024.. A survey on in-context learning. . arXiv:2301.00234 [cs.CL]
  45. Donoho DL, Johnstone IM. 1998.. Minimax estimation via wavelet shrinkage. . Ann. Stat. 26:(3):879921
    [Crossref] [Google Scholar]
  46. Du SS, Lee J, Li H, Wang L, Zhai X. 2019.. Gradient descent finds global minima of deep neural networks. . Proc. Mach. Learn. Res. 97::167585
    [Google Scholar]
  47. Du SS, Zhai X, Poczos B, Singh A. 2018.. Gradient descent provably optimizes over-parameterized neural networks. Paper presented at the International Conference on Learning Representations (ICLR 2019), New Orleans, LA:, May 6–9
    [Google Scholar]
  48. Dwork C. 2008.. Differential privacy: a survey of results. . In Theory and Applications of Models of Computation: 5th International Conference, TAMC 2008, Xi'an, China, April 25-29, 2008, Proceedings, ed. M Agrawal, D Du, Z Duan, A Li , pp. 119. New York:: Springer
    [Google Scholar]
  49. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, et al. 2019.. A guide to deep learning in healthcare. . Nat. Med. 25:(1):2429
    [Crossref] [Google Scholar]
  50. Fan J, Ma C, Zhong Y. 2021.. A selective overview of deep learning. . Stat. Sci. 36:(2):26490
    [Google Scholar]
  51. Fang Z, Cheng G. 2023.. Optimal learning rates of deep convolutional neural networks: additive ridge functions. . arXiv:2202.12119 [cs.LG]
  52. Farrell MH, Liang T, Misra S. 2021.. Deep neural networks for estimation and inference. . Econometrica 89:(1):181213
    [Crossref] [Google Scholar]
  53. Garg S, Tsipras D, Liang PS, Valiant G. 2022.. What can transformers learn in-context? A case study of simple function classes. . In NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems, ed. S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, A Oh , pp. 3058398. Red Hook, NY:: Curran
    [Google Scholar]
  54. Ghorbani B, Mei S, Misiakiewicz T, Montanari A. 2020.. Discussion of: ``Nonparametric regression using deep neural networks with ReLU activation function. .'' Ann. Stat. 48:(4):1898901
    [Crossref] [Google Scholar]
  55. Ghorbani B, Mei S, Misiakiewicz T, Montanari A. 2021a.. When do neural networks outperform kernel methods?. arXiv:2006.13409 [stat.ML]
  56. Ghorbani B, Mei S, Misiakiewicz T, Montanari A. 2021b.. Linearized two-layers neural networks in high dimension. . Ann. Stat. 49:(2):102954
    [Crossref] [Google Scholar]
  57. Giannou A, Rajput S, Sohn J-y, Lee K, Lee JD, Papailiopoulos D. 2023.. Looped transformers as programmable computers. . arXiv:2301.13196 [cs.LG]
  58. Girosi F, Poggio T. 1989.. Representation properties of networks: Kolmogorov's theorem is irrelevant. . Neural Comput. 1:(4):46569
    [Crossref] [Google Scholar]
  59. Girshick R, Donahue J, Darrell T, Malik J. 2014.. Rich feature hierarchies for accurate object detection and semantic segmentation. . In CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 58087. Piscataway, NJ:: IEEE
    [Google Scholar]
  60. Goodfellow I, Bengio Y, Courville A. 2016.. Deep Learning. Cambridge, MA:: MIT Press
    [Google Scholar]
  61. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014.. Generative adversarial nets. . In NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems, ed. Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger , pp. 267280. Red Hook, NY:: Curran
    [Google Scholar]
  62. Grigorescu S, Trasnea B, Cocias T, Macesanu G. 2020.. A survey of deep learning techniques for autonomous driving. . J. Field Robot. 37:(3):36286
    [Crossref] [Google Scholar]
  63. Gui J, Sun Z, Wen Y, Tao D, Ye J. 2021.. A review on generative adversarial networks: algorithms, theory, and applications. . IEEE Trans. Knowl. Data Eng. 35:(4):331332
    [Crossref] [Google Scholar]
  64. Gunasekar S, Lee J, Soudry D, Srebro N. 2018.. Characterizing implicit bias in terms of optimization geometry. . Proc. Mach. Learn. Res. 80::183241
    [Google Scholar]
  65. Han S, Mao H, Dally WJ. 2016.. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. . arXiv:1510.00149 [cs.CV]
  66. Han S, Pool J, Tran J, Dally W. 2015.. Learning both weights and connections for efficient neural network. . In NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems, ed. C Cortes, DD Lee, M Sugiyama, R Garnett , pp. 113543. Red Hook, NY:: Curran
    [Google Scholar]
  67. Han Z, Yu S, Lin SB, Zhou DX. 2022.. Depth selection for deep ReLU nets in feature extraction and generalization. . IEEE Trans. Pattern Anal. Mach. Intel. 44:(4):185368
    [Crossref] [Google Scholar]
  68. Hanin B. 2019.. Universal function approximation by deep neural nets with bounded width and ReLU activations. . Mathematics 7:(10):992
    [Crossref] [Google Scholar]
  69. Hayakawa S, Suzuki T. 2020.. On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. . Neural Netw. 123::34361
    [Crossref] [Google Scholar]
  70. He F, Tao D. 2021.. Recent advances in deep learning theory. . arXiv:2012.10931 [cs.LG]
  71. Heaton JB, Polson NG, Witte JH. 2017.. Deep learning for finance: deep portfolios. . Appl. Stochast. Models Bus. Ind. 33:(1):312
    [Crossref] [Google Scholar]
  72. Hecht-Nielsen R. 1987.. Kolmogorov's mapping neural network existence theorem. . In Proceedings of the International Conference on Neural Networks, Vol. 3, pp. 1114. New York:: IEEE
    [Google Scholar]
  73. Ho J, Jain A, Abbeel P. 2020.. Denoising diffusion probabilistic models. . In NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems, ed. H Larochelle, M Ranzato, R Hadsell, MF Balcan, H. Lin , pp. 684051. Red Hook, NY:: Curran
    [Google Scholar]
  74. Hornik K, Stinchcombe M, White H. 1989.. Multilayer feedforward networks are universal approximators. . Neural Netw. 2:(5):35966
    [Crossref] [Google Scholar]
  75. Hornik K, Stinchcombe M, White H. 1990.. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. . Neural Netw. 3:(5):55160
    [Crossref] [Google Scholar]
  76. Hu T, Shang Z, Cheng G. 2020.. Optimal rate of convergence for deep neural network classifiers under the teacher-student setting. . arXiv:2001.06892 [stat.ML]
  77. Hu T, Wang W, Lin C, Cheng G. 2021.. Regularization matters: a nonparametric perspective on overparametrized neural network. . Proc. Mach. Learn. Res. 130::82937
    [Google Scholar]
  78. Hu W, Li Z, Yu D. 2019.. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. Paper presented at the International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia:, Apr. 30
    [Google Scholar]
  79. Hu W, Xiao L, Adlam B, Pennington J. 2020.. The surprising simplicity of the early-time learning dynamics of neural networks. . In NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems, ed. H Larochelle, M Ranzato, R Hadsell, MF Balcan, H. Lin , pp. 1711628. Red Hook, NY:: Curran
    [Google Scholar]
  80. Huang Y, Cheng Y, Liang Y. 2023.. In-context convergence of transformers. . arXiv:2310.05249 [cs.LG]
  81. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. 2016.. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. . arXiv:1602.07360 [cs.CV]
  82. Imaizumi M, Fukumizu K. 2019.. Deep neural networks learn non-smooth functions effectively. . Proc. Mach. Learn. Res. 89::86978
    [Google Scholar]
  83. Imaizumi M, Fukumizu K. 2022.. Advantage of deep neural networks for estimating functions with singularity on hypersurfaces. . J. Mach. Learn. Res. 23::4772825
    [Google Scholar]
  84. Jacot A, Hongler C, Gabriel F. 2018.. Neural tangent kernel: convergence and generalization in neural networks. . In NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, ed. S Bengio, HM Wallach, H Larochelle, K Grauman, N Cesa-Bianchi , pp. 858089. Red Hook, NY:: Curran
    [Google Scholar]
  85. Ji Z, Telgarsky M. 2019.. Risk and parameter convergence of logistic regression. . arXiv:1803.07300 [cs.LG]
  86. Jiang H, Chen Z, Chen M, Liu F, Wang D, Zhao T. 2018.. On computation and generalization of generative adversarial networks under spectrum control. Paper presented at the International Conference on Learning Representations (ICLR 2019), New Orleans, LA:, May 6–9
    [Google Scholar]
  87. Jiao Y, Shen G, Lin Y, Huang J. 2021.. Deep nonparametric regression on approximately low-dimensional manifolds. . arXiv:2104.06708 [math.ST]
  88. Johnson J. 2018.. Deep, skinny neural networks are not universal approximators. . arXiv:1810.00393 [cs.LG]
  89. Kidger P, Lyons T. 2020.. Universal approximation with deep narrow networks. . Proc. Mach. Learn. Res. 125::230627
    [Google Scholar]
  90. Kim J, Lee C, Park N. 2022.. STASY: score-based tabular data synthesis. Paper presented at the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda:, May 1–5
    [Google Scholar]
  91. Kim Y, Ohn I, Kim D. 2021.. Fast convergence rates of deep neural networks for classification. . Neural Netw. 138::17997
    [Crossref] [Google Scholar]
  92. Kohler M, Krzyżak A, Langer S. 2022.. Estimation of a function of low local dimensionality by deep neural networks. . IEEE Trans. Inform. Theory 68:(6):403242
    [Crossref] [Google Scholar]
  93. Kohler M, Langer S. 2020.. Discussion of: ``Nonparametric regression using deep neural networks with ReLU activation function. .'' Ann. Stat. 48:(4):190610
    [Crossref] [Google Scholar]
  94. Kohler M, Langer S. 2021.. On the rate of convergence of fully connected deep neural network regression estimates. . Ann. Stat. 49:(4):223149
    [Crossref] [Google Scholar]
  95. Krizhevsky A, Sutskever I, Hinton GE. 2012.. ImageNet classification with deep convolutional neural networks. . In NIPS'12: Proceedings of the 25th International Conference on Neural Information Processing Systems, ed. F Pereira, CJC Burges, L Bottou, KQ Weinberger , pp. 1097105. Red Hook, NY:: Curran
    [Google Scholar]
  96. Kutyniok G. 2020.. Discussion of: ``Nonparametric regression using deep neural networks with ReLU activation function. .'' Ann. Stat. 48:(4):19025
    [Crossref] [Google Scholar]
  97. Lai J, Xu M, Chen R, Lin Q. 2023.. Generalization ability of wide neural networks on . . arXiv:2302.05933 [stat.ML]
  98. Lee H, Lu J, Tan Y. 2022.. Convergence for score-based generative modeling with polynomial complexity. . In NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems, ed. S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, A Oh , pp. 2287082. Red Hook, NY:: Curran
    [Google Scholar]
  99. Lee H, Lu J, Tan Y. 2023.. Convergence of score-based generative modeling for general data distributions. . Proc. Mach. Learn. Res. 201::94685
    [Google Scholar]
  100. Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, Sohl-Dickstein J. 2018.. Deep neural networks as Gaussian processes. Paper presented at the International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Can.:, Apr. 30–May 3
    [Google Scholar]
  101. Li X, Wang C, Cheng G. 2023.. Statistical theory of differentially private marginal-based data synthesis algorithms. . arXiv:2301.08844 [cs.LG]
  102. Li Y, Ildiz ME, Papailiopoulos D, Oymak S. 2023.. Transformers as algorithms: generalization and implicit model selection in in-context learning. . arXiv:2301.07067 [cs.LG]
  103. Li Y, Liang Y. 2018.. Learning overparameterized neural networks via stochastic gradient descent on structured data. . In NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, ed. S Bengio, HM Wallach, H Larochelle, K Grauman, N Cesa-Bianchi , pp. 816877. Red Hook, NY:: Curran
    [Google Scholar]
  104. Li Z, Wang R, Yu D, Du SS, Hu W, et al. 2019.. Enhanced convolutional neural tangent kernels. . arXiv:1911.00809 [cs.LG]
  105. Liang T. 2021.. How well generative adversarial networks learn distributions. . J. Mach. Learn. Res. 22:(1):10366406
    [Google Scholar]
  106. Liu X, Wu L, Ye M, Liu Q. 2022.. Let us build bridges: understanding and extending diffusion generative models. . arXiv:2208.14699 [cs.LG]
  107. Liu Z, Wang Y, Vaidya S, Ruehle F, Halverson J, et al. 2024.. KAN: Kolmogorov-Arnold networks. . arXiv:2404.19756 [cs.LG]
  108. Lu J, Shen Z, Yang H, Zhang S. 2021.. Deep network approximation for smooth functions. . SIAM J. Math. Anal. 53:(5):5465506
    [Crossref] [Google Scholar]
  109. Lu Z, Pu H, Wang F, Hu Z, Wang L. 2017.. The expressive power of neural networks: a view from the width. . In NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, ed. U von Luxburg, I Guyon, S Bengio, H Wallach, R Fergus , pp. 623240. Red Hook, NY:: Curran
    [Google Scholar]
  110. Mei S, Bai Y, Montanari A. 2018a.. The landscape of empirical risk for nonconvex losses. . Ann. Stat. 46:(6A):274774
    [Crossref] [Google Scholar]
  111. Mei S, Misiakiewicz T, Montanari A. 2019.. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. . Proc. Mach. Learn. Res. 99::2388464
    [Google Scholar]
  112. Mei S, Montanari A, Nguyen PM. 2018b.. A mean field view of the landscape of two-layer neural networks. . PNAS 115:(33):E766571
    [Crossref] [Google Scholar]
  113. Mhaskar H, Liao Q, Poggio T. 2017.. When and why are deep networks better than shallow ones?. In AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 234349. Washington, DC:: AAAI
    [Google Scholar]
  114. Mhaskar HN. 1996.. Neural networks for optimal approximation of smooth and analytic functions. . Neural Comput. 8:(1):16477
    [Crossref] [Google Scholar]
  115. Montanelli H, Du Q. 2019.. New error bounds for deep ReLU networks using sparse grids. . SIAM J. Math. Data Sci. 1:(1):7892
    [Crossref] [Google Scholar]
  116. Montufar GF, Pascanu R, Cho K, Bengio Y. 2014.. On the number of linear regions of deep neural networks. . In NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems, ed. Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger , pp. 292432. Red Hook, NY:: Curran
    [Google Scholar]
  117. Müller A. 1997.. Integral probability metrics and their generating classes of functions. . Adv. Appl. Probability 29:(2):42943
    [Crossref] [Google Scholar]
  118. Müller-Franzes G, Niehues JM, Khader F, Arasteh ST, Haarburger C, et al. 2022.. Diffusion probabilistic models beat GANs on medical images. . arXiv:2212.07501 [eess.IV]
  119. Nakada R, Imaizumi M. 2020.. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. . J. Mach. Learn. Res. 21:(174):138
    [Google Scholar]
  120. Neelakantan A, Vilnis L, Le QV, Sutskever I, Kaiser L, et al. 2015.. Adding gradient noise improves learning for very deep networks. . arXiv:1511.06807 [stat.ML]
  121. Neyshabur B. 2017.. Implicit regularization in deep learning. . arXiv:1709.01953 [cs.LG]
  122. Nguyen PM, Pham HT. 2023.. A rigorous framework for the mean field limit of multilayer neural networks. . arXiv:2001.11443 [cs.LG]
  123. Nitanda A, Suzuki T. 2020.. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. Paper presented at International Conference on Learning Representations (ICLR 2021), Vienna, Austria:, May 4
    [Google Scholar]
  124. Novak R, Xiao L, Lee J, Bahri Y, Yang G, et al. 2020.. Bayesian deep convolutional networks with many channels are Gaussian processes. . arXiv:1810.05148 [stat.ML]
  125. Oko K, Akiyama S, Suzuki T. 2023.. Diffusion models are minimax optimal distribution estimators. . arXiv:2303.01861 [stat.ML]
  126. Otter DW, Medina JR, Kalita JK. 2020.. A survey of the usages of deep learning for natural language processing. . IEEE Trans. Neural Netw. Learn. Syst. 32:(2):60424
    [Crossref] [Google Scholar]
  127. Ouyang Y, Xie L, Cheng G. 2023a.. Improving adversarial robustness through the contrastive-guided diffusion process. . Proc. Mach. Learn. Res. 202::26699723
    [Google Scholar]
  128. Ouyang Y, Xie L, Li C, Cheng G. 2023b.. MissDiff: training diffusion models on tabular data with missing values. . arXiv:2307.00467 [cs.LG]
  129. Oymak S, Soltanolkotabi M. 2020.. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks. . IEEE J. Sel. Areas Inform. Theory 1:(1):84105
    [Crossref] [Google Scholar]
  130. Park S, Yun C, Lee J, Shin J. 2020.. Minimum width for universal approximation. . arXiv:2006.08859 [cs.LG]
  131. Pascanu R, Montufar G, Bengio Y. 2014.. On the number of response regions of deep feed forward networks with piece-wise linear activations. . arXiv:1312.6098 [cs.LG]
  132. Petersen P, Voigtlaender F. 2018.. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. . Neural Netw. 108::296330
    [Crossref] [Google Scholar]
  133. Poggio T, Girosi F. 1989.. A theory of networks for approximation and learning. Rep. AIM 1140 , Artif. Intell. Lab., Cambridge, MA:
    [Google Scholar]
  134. Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S. 2016.. Exponential expressivity in deep neural networks through transient chaos. . In NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems, ed. DD Lee, U von Luxburg, R Garnett, M Sugiyama, I Guyon , pp. 336876. Red Hook, NY:: Curran
    [Google Scholar]
  135. Roweis ST, Saul LK. 2000.. Nonlinear dimensionality reduction by locally linear embedding. . Science 290:(5500):232326
    [Crossref] [Google Scholar]
  136. Ryzhik L. 2023.. Lecture notes for Math 272, winter 2023. Lecture Notes, Dep. Math., Stanford Univ., Stanford, CA:
    [Google Scholar]
  137. Särkkä S, Solin A. 2019.. Applied Stochastic Differential Equations. Cambridge, UK:: Cambridge Univ. Press
    [Google Scholar]
  138. Schmidt-Hieber J. 2020.. Nonparametric regression using deep neural networks with ReLU activation function. . Ann. Stat. 48:(4):187597
    [Google Scholar]
  139. Schmidt-Hieber J. 2021.. The Kolmogorov–Arnold representation theorem revisited. . Neural Netw. 137::11926
    [Crossref] [Google Scholar]
  140. Schreuder N, Brunel VE, Dalalyan A. 2021.. Statistical guarantees for generative models without domination. . Proc. Mach. Learn. Res. 132::105171
    [Google Scholar]
  141. Shamir O. 2020.. Discussion of: ``Nonparametric regression using deep neural networks with ReLU activation function. .'' Ann. Stat. 48:(4):191115
    [Crossref] [Google Scholar]
  142. Shen Z, Yang H, Zhang S. 2021.. Deep network approximation characterized by number of neurons. . arXiv:1906.05497 [math.NA]
  143. Smith S, Elsen E, De S. 2020.. On the generalization benefit of noise in stochastic gradient descent. . Proc. Mach. Learn. Res. 119::905867
    [Google Scholar]
  144. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. 2020.. Score-based generative modeling through stochastic differential equations. Paper presented at International Conference on Learning Representations (ICLR 2021), Vienna, Austria:, May 4
    [Google Scholar]
  145. Song Z, Yang X. 2020.. Quadratic suffices for over-parametrization via matrix Chernoff bound. . arXiv:1906.03593 [cs.LG]
  146. Suh N, Ko H, Huo X. 2021.. A non-parametric regression viewpoint: Generalization of overparametrized deep ReLU network under noisy observations. Paper presented at the International Conference on Learning Representations (ICLR 2022), virtual, Apr. 25
    [Google Scholar]
  147. Suh N, Lin X, Hsieh DY, Honarkhah M, Cheng G. 2023.. AutoDiff: combining auto-encoder and diffusion model for tabular data synthesizing. . arXiv:2310.15479 [stat.ML]
  148. Suh N, Zhou TY, Huo X. 2022.. Approximation and non-parametric estimation of functions over high-dimensional spheres via deep ReLU networks. Paper presented at the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda:, May 1–5
    [Google Scholar]
  149. Suzuki T. 2018.. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. Paper presented at the International Conference on Learning Representations (ICLR 2019), New Orleans, LA:, May 6–9
    [Google Scholar]
  150. Suzuki T, Nitanda A. 2021.. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. . In NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems, ed. M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan , pp. 360921. Red Hook, NY:: Curran
    [Google Scholar]
  151. Tang R, Yang Y. 2022.. Minimax rate of distribution estimation on unknown submanifold under adversarial losses. . arXiv:2202.09030 [math.ST]
  152. Tenenbaum JB, de Silva V, Langford JC. 2000.. A global geometric framework for nonlinear dimensionality reduction. . Science 290:(5500):231923
    [Crossref] [Google Scholar]
  153. Van de Geer SA. 2000.. Empirical Processes in M-Estimation. Cambridge, UK:: Cambridge Univ. Press
    [Google Scholar]
  154. Vardi G, Yehudai G, Shamir O. 2022.. Width is less important than depth in ReLU neural networks. . Proc. Mach. Learn. Res. 178::124981
    [Google Scholar]
  155. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. 2017.. Attention is all you need. . In NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, ed. U von Luxburg, I Guyon, S Bengio, H Wallach, R Fergus , pp. 600010. Red Hook, NY:: Curran
    [Google Scholar]
  156. Von Oswald J, Niklasson E, Randazzo E, Sacramento J, Mordvintsev A, et al. 2023.. Transformers learn in-context by gradient descent. . Proc. Mach. Learn. Res. 202::3515174
    [Google Scholar]
  157. Wainwright MJ. 2019.. High-Dimensional Statistics: A Non-asymptotic Viewpoint. Cambridge, UK:: Cambridge Univ. Press
    [Google Scholar]
  158. Wei C, Lee JD, Liu Q, Ma T. 2019.. Regularization matters: generalization and optimization of neural nets versus their induced kernel. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 971224. Red Hook, NY:: Curran
    [Google Scholar]
  159. Wu X, Du SS, Ward R. 2019.. Global convergence of adaptive gradient methods for an over-parameterized neural network. . arXiv:1902.07111 [cs.LG]
  160. Xing Y, Song Q, Cheng G. 2022a.. Unlabeled data help: minimax analysis and adversarial robustness. . Proc. Mach. Learn. Res. 151::13668
    [Google Scholar]
  161. Xing Y, Song Q, Cheng G. 2022b.. Why do artificially generated data help adversarial robustness?. In NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems, ed. S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, A Oh , pp. 95466. Red Hook, NY:: Curran
    [Google Scholar]
  162. Xu S, Sun WW, Cheng G. 2024.. Utility theory of synthetic data generation. . arXiv:2305.10015 [stat.ML]
  163. Xu S, Wang C, Sun WW, Cheng G. 2023.. Binary classification under local label differential privacy using randomized response mechanisms. . Trans. Mach. Learn. Res. https://openreview.net/forum?id=uKCGOw9bGG
    [Google Scholar]
  164. Yang G, Hu EJ. 2022.. Feature learning in infinite-width neural networks. . arXiv:2011.14522 [cs.LG]
  165. Yang L, Zhang Z, Song Y, Hong S, Xu R, et al. 2024.. Diffusion models: a comprehensive survey of methods and applications. . arXiv:2209.00796 [cs.LG]
  166. Yarotsky D. 2017.. Error bounds for approximations with deep ReLU networks. . Neural Netw. 94::10314
    [Crossref] [Google Scholar]
  167. Yehudai G, Shamir O. 2019.. On the power and limitations of random features for understanding neural networks. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 6598608. Red Hook, NY:: Curran
    [Google Scholar]
  168. Zeng X, Dobriban E, Cheng G. 2024.. Bayes-optimal classifiers under group fairness. . arXiv:2202.09724 [stat.ML]
  169. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. 2021.. Understanding deep learning (still) requires rethinking generalization. . Commun. ACM 64:(3):10715
    [Crossref] [Google Scholar]
  170. Zhang P, Liu Q, Zhou D, Xu T, He X. 2018.. On the discrimination-generalization tradeoff in GANs. . arXiv:1711.02771 [cs.LG]
  171. Zhang R, Frei S, Bartlett PL. 2023.. Trained transformers learn linear models in-context. . arXiv:2306.09927 [stat.ML]
  172. Zhong G, Wang LN, Ling X, Dong J. 2016.. An overview on data representation learning: from traditional feature learning to recent deep learning. . J. Finance Data Sci. 2:(4):26578
    [Crossref] [Google Scholar]
  173. Zou D, Cao Y, Zhou D, Gu Q. 2018.. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. . arXiv:1811.08888 [cs.LG]
  174. Zou D, Gu Q. 2019.. An improved analysis of training over-parameterized deep neural networks. . In NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, ed. HM Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, EB Fox , pp. 205564. Red Hook, NY:: Curran
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040522-013920
Loading
/content/journals/10.1146/annurev-statistics-040522-013920
Loading

Data & Media loading...

Supplemental Materials

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error