1932

Abstract

We expose the statistical foundations of deep learning with the goal of facilitating conversation between the deep learning and statistics communities. We highlight core themes at the intersection; summarize key neural models, such as feedforward neural networks, sequential neural networks, and neural latent variable models; and link these ideas to their roots in probability and statistics. We also highlight research directions in deep learning where there are opportunities for statistical contributions.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-032921-013738
2023-03-09
2024-06-22
Loading full text...

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-032921-013738.html?itemId=/content/journals/10.1146/annurev-statistics-032921-013738&mimeType=html&fmt=ahah

Literature Cited

  1. Aamodt A, Plaza E. 1994. Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun. 7:139–59
    [Google Scholar]
  2. Allamanis M, Brockschmidt M, Khademi M. 2017. Learning to represent programs with graphs. arXiv:1711.00740 [cs.LG]
  3. Andrychowicz M, Denil M, Gomez S, Hoffman MW, Pfau D et al. 2016. Learning to learn by gradient descent by gradient descent. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 3988–96 Red Hook, NY: Curran
    [Google Scholar]
  4. Angelopoulos AN, Bates S, Jordan M, Malik J 2020. Uncertainty sets for image classifiers using conformal prediction. Presented at International Conference on Learning Representations, virtual, April 26–May 1
    [Google Scholar]
  5. Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450 [stat.ML]
  6. Bahdanau D, Cho K, Bengio Y. 2015. Neural machine translation by jointly learning to align and translate. Presented at International Conference on Learning Representations San Diego, CA: May 7–9
    [Google Scholar]
  7. Baldi P, Hornik K. 1989. Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2:153–58
    [Google Scholar]
  8. Baldi P, Vershynin R. 2019. The capacity of feedforward neural networks. Neural Netw. 116:288–311
    [Google Scholar]
  9. Bartlett PL, Long PM, Lugosi G, Tsigler A. 2020. Benign overfitting in linear regression. PNAS 117:4830063–70
    [Google Scholar]
  10. Bartlett PL, Montanari A, Rakhlin A. 2021. Deep learning: a statistical viewpoint. Acta Numer. 30:87–201
    [Google Scholar]
  11. Becker S, LeCun Y 1989. Improving the convergence of back-propagation learning with second order methods. Proceedings of the 1988 Connectionist Models Summer School D Touretzky, G Hinton, T Sejnowski 29–37 San Francisco, CA: Morgan Kaufmann
    [Google Scholar]
  12. Belkin M, Hsu D, Xu J. 2020. Two models of double descent for weak features. SIAM J. Math. Data Sci. 2:41167–80
    [Google Scholar]
  13. Bengio Y, Courville A, Vincent P. 2013a. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intel. 35:81798–828
    [Google Scholar]
  14. Bengio Y, Yao L, Alain G, Vincent P 2013b. Generalized denoising auto-encoders as generative models. Advances in Neural Information Processing Systems 26 (NIPS 2013) C Burges, L Bottou, M Welling, Z Ghahramani, K Weinberger 899–907 Red Hook, NY: Curran
    [Google Scholar]
  15. Bishop CM. 1994. Novelty detection and neural network validation. IEEE Proc. Vis. Image Signal Proc. 141:4217–22
    [Google Scholar]
  16. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S et al. 2022. On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]
  17. Bottou L 2010. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 Y Lechevallier, G Saporta 177–86 Berlin: Springer
    [Google Scholar]
  18. Bourlard H, Kamp Y. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybernet. 59:4–5291–94
    [Google Scholar]
  19. Breiman L. 2001. Statistical modeling: the two cultures. Stat. Sci. 16:3199–231
    [Google Scholar]
  20. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (NeurIPS 2020) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 1877–901 Red Hook, NY: Curran
    [Google Scholar]
  21. Chen RT, Amos B, Nickel M 2020. Neural spatio-temporal point processes. Presented at International Conference on Learning Representations, virtual, April 26–May 1
    [Google Scholar]
  22. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK 2018. Neural ordinary differential equations. Advances in Neural Information Processing Systems 31 (NeurIPS 2018) S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 6572–83 Red Hook, NY: Curran
    [Google Scholar]
  23. Cheng B, Titterington M. 1994. Neural networks: a review from a statistical perspective. Stat. Sci. 9:12–30
    [Google Scholar]
  24. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F et al. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 [cs.CL]
  25. Cohen TS, Geiger M, Köhler J, Welling M. 2018. Spherical CNNs Presented at International Conference on Learning Representations Vancouver: Apr. 30–May 3
    [Google Scholar]
  26. Cottrell GW. 1989. Image compression by back propagation: a demonstration of extensional programming. Models Cogn. 3:208–40
    [Google Scholar]
  27. Cybenko G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2:4303–14
    [Google Scholar]
  28. Dahl GE, Yu D, Deng L, Acero A. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Proc. 20:130–42
    [Google Scholar]
  29. Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27 (NIPS 2014) Z Ghahramani, M Welling, C Cortes, N Lawrence, K Weinberger 2933–41 Red Hook, NY: Curran
    [Google Scholar]
  30. Devlin J, Chang MW, Lee K, Toutanova K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies4171–86 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  31. Doersch C. 2021. Tutorial on variational autoencoders. arXiv:1606.05908 [stat.ML]
  32. Doshi-Velez F, Kim B. 2017. Towards a rigorous science of interpretable machine learning. arXiv:1702.08608 [stat.ML]
  33. Duchi J, Hazan E, Singer Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:72121–59
    [Google Scholar]
  34. Duin RP. 2000. Classifiers in almost empty spaces. Proceedings of the 15th International Conference on Pattern Recognition, ICPR-2000, Vol. 21–7 New York: IEEE
    [Google Scholar]
  35. Dwork C 2011. Differential privacy. Encyclopedia of Cryptography and Security HCA van Tilborg, S Jajodia 338–40 Berlin: Springer
    [Google Scholar]
  36. Efron B. 2020. Prediction, estimation, and attribution. Int. Stat. Rev. 88:S28–59
    [Google Scholar]
  37. Efron B, Hastie T. 2016. Computer Age Statistical Inference Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  38. Eldan R, Shamir O. 2016. The power of depth for feedforward neural networks. PMLR 49:907–40
    [Google Scholar]
  39. Elman JL. 1990. Finding structure in time. Cogn. Sci. 14:2179–211
    [Google Scholar]
  40. Fan J, Ma C, Zhong Y. 2021. A selective overview of deep learning. Stat. Sci. 36:2264
    [Google Scholar]
  41. Finn C. 2018. Learning to learn with gradients PhD Thesis, University of California Berkeley:
    [Google Scholar]
  42. Finn C, Abbeel P, Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. PMLR 70:1126–35
    [Google Scholar]
  43. Foulds JR, Islam R, Keya KN, Pan S. 2020. An intersectional definition of fairness. 2020 IEEE 36th International Conference on Data Engineering (ICDE)1918–21 New York: IEEE
    [Google Scholar]
  44. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. 2017. Neural message passing for quantum chemistry. PMLR 70:1263–72
    [Google Scholar]
  45. Goodfellow I, Bengio Y, Courville A. 2016. Deep Learning Cambridge, MA: MIT Press
    [Google Scholar]
  46. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D et al. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems 27 (NIPS 2014) Z Ghahramani, M Welling, C Cortes, N Lawrence, K Weinberger 2672–80 Red Hook, NY: Curran
    [Google Scholar]
  47. Goodfellow I, Shlens J, Szegedy C. 2015. Explaining and harnessing adversarial examples Presented at International Conference on Learning Representations San Diego, CA: May 7–9
    [Google Scholar]
  48. Grathwohl W, Wang KC, Jacobsen JH, Duvenaud D, Norouzi M, Swersky K. 2019. Your classifier is secretly an energy based model and you should treat it like one. Presented at International Conference on Learning Representations New Orleans, LA: May 6–9
    [Google Scholar]
  49. Graves A. 2012. Sequence transduction with recurrent neural networks. arXiv:1211.3711 [cs.NE]
  50. Grigorescu S, Trasnea B, Cocias T, Macesanu G. 2020. A survey of deep learning techniques for autonomous driving. J. Field Robot. 37:3362–86
    [Google Scholar]
  51. Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D. 2018. A survey of methods for explaining black box models. ACM Comput. Surv. 51:593
    [Google Scholar]
  52. Guo C, Pleiss G, Sun Y, Weinberger KQ. 2017. On calibration of modern neural networks. PMLR 70:1321–30
    [Google Scholar]
  53. Ha D, Dai A, Le QV. 2017. Hypernetworks Presented at International Conference on Learning Representations Toulon, France: Apr. 24–26
    [Google Scholar]
  54. Hafner D, Tran D, Lillicrap T, Irpan A, Davidson J. 2019. Noise contrastive priors for functional uncertainty. PMLR 115:894–904
    [Google Scholar]
  55. Halevy A, Norvig P, Pereira F. 2009. The unreasonable effectiveness of data. IEEE Intel. Syst. 24:28–12
    [Google Scholar]
  56. Hastie T, Montanari A, Rosset S, Tibshirani RJ. 2022. Surprises in high-dimensional ridgeless least squares interpolation. Ann. Stat. 50:2949–86
    [Google Scholar]
  57. He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)770–78 New York: IEEE
    [Google Scholar]
  58. Hendrycks D, Mazeika M, Dietterich T. 2019. Deep anomaly detection with outlier exposure. Presented at International Conference on Learning Representations New Orleans, LA: May 6–9
    [Google Scholar]
  59. Heskes T 2000. Empirical Bayes for learning to learn. Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000) P Langley 367–74 San Francisco, CA: Morgan Kaufmann
    [Google Scholar]
  60. Hewamalage H, Bergmeir C, Bandara K. 2021. Recurrent neural networks for time series forecasting: current status and future directions. Int. J. Forecast. 37:1388–427
    [Google Scholar]
  61. Hinton GE, Salakhutdinov RR. 2006. Reducing the dimensionality of data with neural networks. Science 313:5786504–7
    [Google Scholar]
  62. Hochreiter S, Schmidhuber J. 1997a. Flat minima. Neural Comput. 9:11–42
    [Google Scholar]
  63. Hochreiter S, Schmidhuber J. 1997b. Long short-term memory. Neural Comput. 9:81735–80
    [Google Scholar]
  64. Huang CW, Krueger D, Lacoste A, Courville A. 2018. Neural autoregressive flows. PMLR 80:2078–87
    [Google Scholar]
  65. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. PMLR 37:448–56
    [Google Scholar]
  66. Izmailov P, Vikram S, Hoffman MD, Wilson AGG. 2021. What are Bayesian neural network posteriors really like?. PMLR 139:4629–40
    [Google Scholar]
  67. Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI. 2017. How to escape saddle points efficiently. PMLR 70:1724–32
    [Google Scholar]
  68. Johnson M, Duvenaud DK, Wiltschko A, Adams RP, Datta SR 2016. Composing graphical models with neural networks for structured representations and fast inference. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 2954–62 Red Hook, NY: Curran
    [Google Scholar]
  69. Jumper J, Evans R, Pritzel A, Green T, Figurnov M et al. 2021. Highly accurate protein structure prediction with Alphafold. Nature 596:7873583–89
    [Google Scholar]
  70. Jurafsky D, Martin JH. 2022. Speech and Language Processing Englewood Cliffs, NJ: Pearson. , 3rd ed..
    [Google Scholar]
  71. Kawaguchi K 2016. Deep learning without poor local minima. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 586–94 Red Hook, NY: Curran
    [Google Scholar]
  72. Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP. 2017. On large-batch training for deep learning: generalization gap and sharp minima. Presented at International Conference on Learning Representations Toulon, France: Apr. 24–26
  73. Kim B, Khanna R, Koyejo OO 2016. Examples are not enough, learn to criticize! Criticism for interpretability. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 2288–96 Red Hook, NY: Curran
    [Google Scholar]
  74. Kingma D, Ba J. 2014. Adam: a method for stochastic optimization. Presented at International Conference on Learning Representations Banff, Canada: Apr. 14–16
    [Google Scholar]
  75. Kingma D, Dhariwal P 2018. Glow: generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31 (NeurIPS 2018) S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 10236–45 Red Hook, NY: Curran
    [Google Scholar]
  76. Kingma D, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M 2016. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 4743–51 Red Hook, NY: Curran
    [Google Scholar]
  77. Kingma D, Welling M. 2014. Auto-encoding variational Bayes. Presented at International Conference on Learning Representations Banff, Canada: Apr. 14–16
    [Google Scholar]
  78. Klambauer G, Unterthiner T, Mayr A, Hochreiter S 2017. Self-normalizing neural networks. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 972–81 Red Hook, NY: Curran
    [Google Scholar]
  79. Krishnan R, Shalit U, Sontag D. 2017. Structured inference networks for nonlinear state space models. Proceedings of the 31st AAAI Conference on Artificial Intelligence2101–9 Menlo Park, CA: AAAI Press
    [Google Scholar]
  80. Krizhevsky A, Sutskever I, Hinton GE 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (NIPS 2012) F Pereira, C Burges, L Bottou, K Weinberger 1097–105 Red Hook, NY: Curran
    [Google Scholar]
  81. Kusner MJ, Loftus J, Russell C, Silva R 2017. Counterfactual fairness. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 4069–79 Red Hook, NY: Curran
    [Google Scholar]
  82. Lake BM, Salakhutdinov R, Tenenbaum JB. 2015. Human-level concept learning through probabilistic program induction. Science 350:62661332–38
    [Google Scholar]
  83. Lakshminarayanan B, Pritzel A, Blundell C 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 6405–16 Red Hook, NY: Curran
    [Google Scholar]
  84. Le Cun Y 1986. Learning process in an asymmetric threshold network. Disordered Systems and Biological Organization E Bienenstock, FF Soulié, G Weisbuch 233–40 Berlin: Springer
    [Google Scholar]
  85. LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521:7553436–44
    [Google Scholar]
  86. LeCun Y, Boser B, Denker J, Henderson D, Howard R et al. 1989. Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems (NIPS 1989), Vol. 2 D Touretzky 396–404 San Francisco, CA: Morgan Kaufmann
    [Google Scholar]
  87. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86:112278–324
    [Google Scholar]
  88. Lee HK. 2004. Bayesian Nonparametrics via Neural Networks Philadelphia, PA: ASA, SIAM
    [Google Scholar]
  89. Letham B, Rudin C, McCormick TH, Madigan D. 2015. Interpretable classifiers using rules and Bayesian analysis: building a better stroke prediction model. Ann. Appl. Stat. 9:31350–71
    [Google Scholar]
  90. Lim B, Zohren S. 2021. Time-series forecasting with deep learning: a survey. Philos. Trans. R. Soc. A 379:219420200209
    [Google Scholar]
  91. Lipton ZC. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16:331–57
    [Google Scholar]
  92. Lu Z, Pu H, Wang F, Hu Z, Wang L 2017. The expressive power of neural networks: a view from the width. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 6232–40 Red Hook, NY: Curran
    [Google Scholar]
  93. Maas AL, Hannun AY, Ng AY. 2013. Rectifier nonlinearities improve neural network acoustic models. ICML Workshop on Deep Learning for Audio, Speech and Language Processing N.p.: JMLR
    [Google Scholar]
  94. MacKay DJ. 1992. Bayesian methods for adaptive models PhD Thesis, Calif. Inst. Technol. Pasadena, CA:
    [Google Scholar]
  95. MacKay DJ, Gibbs MN. 1999. Density networks. Statistics and Neural Networks: Advances at the Interface JW Kay, DM Titterington 129–46 Oxford, UK: Oxford Univ. Press
    [Google Scholar]
  96. Malinin A, Gales M 2018. Predictive uncertainty estimation via prior networks. Advances in Neural Information Processing Systems 31 (NeurIPS 2018) S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 7047–58 Red Hook, NY: Curran
    [Google Scholar]
  97. Manning CD. 2015. Computational linguistics and deep learning. Comput. Linguist. 41:4701–7
    [Google Scholar]
  98. McClelland JL, Hill F, Rudolph M, Baldridge J, Schütze H. 2020. Placing language in an integrated understanding system: next steps toward human-level performance in neural language models. PNAS 117:4225966–74
    [Google Scholar]
  99. McCulloch WS, Pitts W. 1943. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5:4115–33
    [Google Scholar]
  100. McDermott PL, Wikle CK. 2019. Bayesian recurrent neural network models for forecasting and quantifying uncertainty in spatial-temporal data. Entropy 21:2184
    [Google Scholar]
  101. McDonald RP. 1962. A general approach to nonlinear factor analysis. Psychometrika 27:4397–415
    [Google Scholar]
  102. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54:61–35
    [Google Scholar]
  103. Mei H, Eisner JM 2017. The neural Hawkes process: a neurally self-modulating multivariate point process. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 6757–67 Red Hook, NY: Curran
    [Google Scholar]
  104. Mei S, Montanari A. 2022. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75:4667–766
    [Google Scholar]
  105. Mescheder LM, Nowozin S, Geiger A. 2017. Adversarial variational Bayes: unifying variational autoencoders and generative adversarial networks. PMLR 70:2391–400
    [Google Scholar]
  106. Mohamed S 2015. A statistical view of deep learning Work. Pap. http://blog.shakirm.com/wp-content/uploads/2015/07/SVDL.pdf
    [Google Scholar]
  107. Mohamed S, Lakshminarayanan B. 2017. Learning in implicit generative models. arXiv:1610.03483 [stat.ML]
  108. Murphy KP. 2022. Probabilistic Machine Learning: An Introduction Cambridge, MA: MIT Press
    [Google Scholar]
  109. Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I. 2021. Deep double descent: where bigger models and more data hurt. J. Stat. Mech. Theory Exp. 2021:12124003
    [Google Scholar]
  110. Neal RM. 1994. Bayesian Learning for Neural Networks PhD Thesis, Univ. Toronto Canada:
    [Google Scholar]
  111. Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, Lakshminarayanan B 2021. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22:11–64
    [Google Scholar]
  112. Papamakarios G, Pavlakou T, Murray I 2017. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 2335–44 Red Hook, NY: Curran
    [Google Scholar]
  113. Parker DB. 1985. Learning-logic. Tech. Rep. TR-47, Cent. Comput. Res. Econ. Manag. Sci., MIT Cambridge, MA:
    [Google Scholar]
  114. Parker DB. 1987. Optimal algorithms for adaptive networks: second order back propagation, second order direct propagation, and second order Hebbian learning. IEEE 1st International Conference on Neural Networks, San Diego, Vol. 2593–600 Piscataway, NJ: IEEE
    [Google Scholar]
  115. Pascanu R, Dauphin YN, Ganguli S, Bengio Y. 2014. On the saddle point problem for non-convex optimization. arXiv:1405.4604 [cs.LG]
  116. Pearl J. 2009. Causal inference in statistics: an overview. Stat. Surv. 3:96–146
    [Google Scholar]
  117. Polson NG, Sokolov V. 2017. Deep learning: a Bayesian perspective. Bayesian Anal. 12:41275–304
    [Google Scholar]
  118. Ranganath R, Perotte A, Elhadad N, Blei D. 2016. Deep survival analysis. PMLR 56:101–14
    [Google Scholar]
  119. Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T 2018. Deep state space models for time series forecasting. Advances in Neural Information Processing Systems 31 (NeurIPS 2018) S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 7796–805 Red Hook, NY: Curran
    [Google Scholar]
  120. Ravi S, Larochelle H. 2017. Optimization as a model for few-shot learning. Presented at International Conference on Learning Representations Toulon, France: Apr. 24–26
    [Google Scholar]
  121. Rezende D, Mohamed S 2015. Variational inference with normalizing flows. PMLR 37:1530–38
    [Google Scholar]
  122. Rezende DJ, Mohamed S, Wierstra D 2014. Stochastic backpropagation and approximate inference in deep generative models. PMLR 32:1278–86
    [Google Scholar]
  123. Ripley BD. 1996. Pattern Recognition and Neural Networks Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  124. Robbins H, Monro S. 1951. A stochastic approximation method. Ann. Math. Stat. 22:3400–7
    [Google Scholar]
  125. Rubin DB. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 12:41151–72
    [Google Scholar]
  126. Rumelhart DE, Hinton GE, Williams RJ. 1986. Learning representations by back-propagating errors. Nature 323:533–36
    [Google Scholar]
  127. Salimans T, Kingma D 2016. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems 29 (NIPS 2016) D Lee, M Sugiyama, U Luxburg, I Guyon, R Garnett 901–9 Red Hook, NY: Curran
    [Google Scholar]
  128. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T. 2016. Meta-learning with memory-augmented neural networks. PMLR 48:1842–50
    [Google Scholar]
  129. Savani Y, White C, Govindarajulu NS 2020. Intra-processing methods for debiasing neural networks. Advances in Neural Information Processing Systems 33 (NeurIPS 2020) H Larochelle, M Ranzato, R Hadsell, M Balcan, H Lin 2798–810 Red Hook, NY: Curran
    [Google Scholar]
  130. Schmidhuber J. 2015. Deep learning in neural networks: an overview. Neural Netw. 61:85–117
    [Google Scholar]
  131. Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N et al. 2021. Toward causal representation learning. Proc. IEEE 109:5612–34
    [Google Scholar]
  132. Shafer G, Vovk V. 2008. A tutorial on conformal prediction. J. Mach. Learn. Res. 9:3371–421
    [Google Scholar]
  133. Shi C, Blei D, Veitch V 2019. Adapting neural networks for the estimation of treatment effects. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) H Wallach, H Larochelle, A Beygelzimer, F d'Alché Buc, E Fox, R Garnett 2507–17 Red Hook, NY: Curran
    [Google Scholar]
  134. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A et al. 2017. Mastering the game of Go without human knowledge. Nature 550:7676354–59
    [Google Scholar]
  135. Simonyan K, Vedaldi A, Zisserman A. 2014. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv:1312.6034 [cs.CV]
  136. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15:11929–58
    [Google Scholar]
  137. Stern HS. 1996. Neural networks in applied statistics. Technometrics 38:3205–14
    [Google Scholar]
  138. Sutskever I, Vinyals O, Le QV 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 30 (NIPS 2017) Z Ghahramani, M Welling, C Cortes, N Lawrence, K Weinberger 3104–12 Red Hook, NY: Curran
    [Google Scholar]
  139. Tabak EG, Turner CV. 2013. A family of nonparametric density estimation algorithms. Commun. Pure Appl. Math. 66:2145–64
    [Google Scholar]
  140. Theis L, van den Oord A, Bethge M. 2016. A note on the evaluation of generative models. Presented at International Conference on Learning Representations, San Juan Puerto Rico: May 2–4
    [Google Scholar]
  141. Tran D, Ranganath R, Blei DM 2017. Hierarchical implicit models and likelihood-free variational inference. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 5529–39 Red Hook, NY: Curran
    [Google Scholar]
  142. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017) I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett 6000–10 Red Hook, NY: Curran
    [Google Scholar]
  143. Viering T, Loog M. 2021. The shape of learning curves: a review. arXiv:2103.10948 [cs.LG]
  144. Vincent P. 2011. A connection between score matching and denoising autoencoders. Neural Comput. 23:71661–74
    [Google Scholar]
  145. Vincent P, Larochelle H, Bengio Y, Manzagol PA 2008. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning W Cohen, S Roweis, A McCallum 1096–103 New York: ACM
    [Google Scholar]
  146. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA. 2010. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11:123371–408
    [Google Scholar]
  147. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. 2018. GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 [cs.CL]
  148. Wang P, Li Y, Reddy CK. 2019. Machine learning for survival analysis: a survey. ACM Comput. Surv. 51:61–36
    [Google Scholar]
  149. Wang Y, Smola A, Maddix D, Gasthaus J, Foster D, Januschowski T. 2019. Deep factors for forecasting. PMLR 97:6607–17
    [Google Scholar]
  150. Welling M. 2015. Are ML and statistics complementary? Presented at IMS-ISBA Meeting on Data Science in the Next 50 Years Lenzerheide, Switz: Dec. 28 https://staff.fnwi.uva.nl/m.welling/wp-content/uploads/papers/WhyMLneedsStatistics.pdf
    [Google Scholar]
  151. White H. 1989. Learning in artificial neural networks: a statistical perspective. Neural Comput. 1:4425–64
    [Google Scholar]
  152. Wong E, Kolter Z. 2018. Provable defenses against adversarial examples via the convex outer adversarial polytope. PMLR 80:5283–92
    [Google Scholar]
  153. Wu Z, Pan S, Chen F, Long G, Zhang C, Philip SY. 2020. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32:14–24
    [Google Scholar]
  154. Xia K, Lee KZ, Bengio Y, Bareinboim E 2021. The causal-neural connection: expressiveness, learnability, and inference. Advances in Neural Information Processing Systems 34 (NeurIPS 2021) M Ranzato, A Beygelzimer, Y Dauphin, P Liang, JW Vaughan 10823–36 Red Hook, NY: Curran
    [Google Scholar]
  155. Yalcin I, Amemiya Y. 2001. Nonlinear factor analysis as a statistical method. Stat. Sci. 16:3275–94
    [Google Scholar]
  156. Yarotsky D. 2017. Error bounds for approximations with deep ReLU networks. Neural Netw. 94:103–14
    [Google Scholar]
  157. Yuan Y, Deng Y, Zhang Y, Qu A. 2020. Deep learning from a statistical perspective. Stat 9:1e294
    [Google Scholar]
  158. Zakrzewski R. 2001. Verification of a trained neural network accuracy. International Joint Conference on Neural Networks, Vol. 31657–62 New York: IEEE
    [Google Scholar]
  159. Zemel R, Wu Y, Swersky K, Pitassi T, Dwork C. 2013. Learning fair representations. PMLR 28:325–33
    [Google Scholar]
  160. Zhang H, Yu Y, Jiao J, Xing E, Ghaoui LE, Jordan MI. 2019. Theoretically principled trade-off between robustness and accuracy. PMLR 97:7472–82
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-032921-013738
Loading
/content/journals/10.1146/annurev-statistics-032921-013738
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error