1932

Abstract

By linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts underlying modern generative models and then analyze how they can be used to inform experimental design, iterative model refinement, goodness-of-fit evaluation, and agent based simulation. We emphasize a modular view of generative mechanisms and discuss how they can be flexibly recombined in new problem contexts. We provide practical illustrations throughout, and code for reproducing all examples is available at . Finally, we observe how research in generative models is currently split across several islands of activity, and we highlight opportunities lying at disciplinary intersections.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-033121-110134
2023-03-09
2024-05-23
Loading full text...

Full text loading...

/deliver/fulltext/statistics/10/1/annurev-statistics-033121-110134.html?itemId=/content/journals/10.1146/annurev-statistics-033121-110134&mimeType=html&fmt=ahah

Literature Cited

  1. An G, Mi Q, Dutta-Moscato J, Vodovotz Y. 2009. Agent-based models in translational systems biology. Wiley Interdiscip. Rev. Syst. Biol. Med. 1:2159–71
    [Google Scholar]
  2. Anastasiou A, Barp A, Briol FX, Ebner B, Gaunt RE et al. 2021. Stein's method meets computational statistics: a review of some recent developments. arXiv:2105.03481 [stat.ME]
  3. Andrieu C, Doucet A, Singh SS, Tadic VB. 2004. Particle methods for change detection, system identification, and control. Proc. IEEE 92:3423–38
    [Google Scholar]
  4. Baker E, Barbillon P, Fadikar A, Gramacy RB, Herbei R et al. 2022. Analyzing stochastic computer models: a review with opportunities. Stat. Sci. 37:164–89
    [Google Scholar]
  5. Beaumont MA. 2019. Approximate Bayesian computation. Annu. Rev. Stat. Appl. 6:379–403
    [Google Scholar]
  6. Bertsimas D, King A, Mazumder R. 2016. Best subset selection via a modern optimization lens. Ann. Stat. 44:2813–52
    [Google Scholar]
  7. Bertsimas D, Pauphilet J, Van Parys B. 2020. Sparse regression: scalable algorithms and empirical performance. Stat. Sci. 35:4555–78
    [Google Scholar]
  8. Bingham E, Chen JP, Jankowiak M, Obermeyer F, Pradhan N et al. 2019. Pyro: deep universal probabilistic programming. J. Mach. Learn. Res. 20:1973–78
    [Google Scholar]
  9. Blau T, MacKinlay D. 2021. Probabilistic programming hackfest 2021. Programming Tutorial. https://github.com/csiro-mlai/hackfest-ppl/
    [Google Scholar]
  10. Blei DM. 2014. Build, compute, critique, repeat: data analysis with latent variable models. Annu. Rev. Stat. Appl. 1:203–32
    [Google Scholar]
  11. Blei DM, Carin L, Dunson D 2010. Probabilistic topic models. IEEE Sign. Proc. Mag. 27:655–65
    [Google Scholar]
  12. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC et al. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37:8852–57
    [Google Scholar]
  13. Brehmer J, Louppe G, Pavez J, Cranmer K. 2020. Mining gold from implicit models to improve likelihood-free inference. PNAS 117:105242–49
    [Google Scholar]
  14. Breiman L. 2001. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16:3199–231
    [Google Scholar]
  15. Brun R, Reichert P, Künsch HR. 2001. Practical identifiability analysis of large environmental simulation models. Water Resourc. Res. 37:41015–30
    [Google Scholar]
  16. Cao Y, Yang P, Yang JYH 2021. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat. Commun. 12:6911
    [Google Scholar]
  17. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B et al. 2017. Stan: A probabilistic programming language. J. Stat. Softw. 76:11–32
    [Google Scholar]
  18. Chen Y, Taeb A, Bühlmann P. 2020. A look at robustness and stability of - versus -regularization: discussion of papers by Bertsimas et al. and Hastie et al. Stat. Sci. 35:4614–22
    [Google Scholar]
  19. Clark M. 2022. Bayesian Basics https://m-clark.github.io/bayesian-basics//
  20. Cox DR, Reid N. 2000. The Theory of the Design of Experiments Boca Raton, FL: Chapman and Hall/CRC
  21. Cranmer K, Brehmer J, Louppe G. 2020. The frontier of simulation-based inference. arXiv:1911.01429 [stat.ML]
  22. Crisan D, Doucet A. 2002. A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Sign. Proc. 50:3736–46
    [Google Scholar]
  23. Crowell HL, Leonardo SXM, Soneson C, Robinson MD. 2021. Built on sand: the shaky foundations of simulating single-cell RNA sequencing data. bioRxiv 2021.11.15.468676
  24. Dalmasso N, Zhao D, Izbicki R, Lee AB. 2021. Likelihood-free frequentist inference: bridging classical statistics and machine learning in simulation and uncertainty quantification. arXiv:2107.03920 [stat.ML]
  25. Del Moral P, Doucet A, Jasra A. 2012. An adaptive sequential Monte Carlo method for approximate Bayesian computation. Stat. Comput. 22:51009–20
    [Google Scholar]
  26. Diaconis P. 2009. The Markov chain Monte Carlo revolution. Bull. Am. Math. Soc. 46:2179–205
    [Google Scholar]
  27. Diffenbaugh NS, Field CB, Appel EA, Azevedo IL, Baldocchi DD et al. 2020. The COVID-19 lockdowns: a window into the earth system. Nat. Rev. Earth Environ. 1:9470–81
    [Google Scholar]
  28. Doucet A, Freitas ND, Gordon N 2001. An introduction to sequential Monte Carlo methods. Sequential Monte Carlo Methods in Practice A Doucet, ND Freitas, N Gordon 3–14 New York: Springer
    [Google Scholar]
  29. Draper D, Gaver D, Goel PK, Greenhouse JB, Hedges LV et al. 1993. Combining Information: Statistical Issues and Opportunities for Research Alexandria, VA: Am. Stat. Assoc.
  30. Ferguson NM, Laydon D, Nedjati-Gilani G, Imai N, Ainslie K et al. 2020. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Tech. Rep. COVID-19 Response Team, Imperial College London
  31. Fisher RA. 1937. The Design of Experiments. (, 2nd Ed..) New York: Macmillan
  32. Flutre T, Wen X, Pritchard J, Stephens M. 2013. A statistical framework for joint eQTL analysis in multiple tissues. PLOS Genet. 9:5e1003486
    [Google Scholar]
  33. Friedman JH. 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29:51189–232
    [Google Scholar]
  34. Friedman JH. 2004. Recent advances in predictive (machine) learning. PHYSTAT2003 L Lyons, R Mount, R Reitmeyer 311–13 Menlo Park, CA: Stanf. Linear Accel. Tech. Publ. Dep.
    [Google Scholar]
  35. Gabry J, Simpson D, Vehtari A, Betancourt M, Gelman A. 2019. Visualization in Bayesian workflow. J. R. Stat. Soc. Ser. A 182:2389–402
    [Google Scholar]
  36. Gelman A. 2006. Multilevel (hierarchical) modeling: what it can and cannot do. Technometrics 48:3432–35
    [Google Scholar]
  37. Ghahramani Z. 2015. Probabilistic machine learning and artificial intelligence. Nature 521:7553452–59
    [Google Scholar]
  38. Gorham J, Duncan AB, Vollmer SJ, Mackey L. 2019. Measuring sample quality with diffusions. Ann. Appl. Probab. 29:52884–928
    [Google Scholar]
  39. Gramacy RB. 2020. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences Boca Raton, FL: Chapman and Hall/CRC
  40. Grazzini J, Richiardi MG, Tsionas M. 2017. Bayesian estimation of agent-based models. J. Econ. Dyn. Control 77:26–47
    [Google Scholar]
  41. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. 2012. A kernel two-sample test. J. Mach. Learn. Res. 13:25723–73
    [Google Scholar]
  42. Gutenkunst RN, Waterfall JJ, Casey FP, Brown KS, Myers CR, Sethna JP. 2007. Universally sloppy parameter sensitivities in systems biology models. PLOS Comput. Biol. 3:10e189
    [Google Scholar]
  43. Gutmann MU, Corander J. 2016. Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17:1251–47
    [Google Scholar]
  44. Hastie T, Tibshirani R, Tibshirani R. 2020. Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Stat. Sci. 35:4579–92
    [Google Scholar]
  45. Hill SJ. 2017. Changing votes or changing voters? How candidates and election context swing voters and mobilize the base. Electoral Stud. 48:131–48
    [Google Scholar]
  46. Hinch R, Probert WJ, Nurtay A, Kendall M, Wymant C et al. 2021. OpenABM-Covid19—an agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing. PLOS Comput. Biol. 17:7e1009146
    [Google Scholar]
  47. Holmes S, Huber W. 2018. Modern Statistics for Modern Biology Cambridge, UK: Cambridge Univ. Press
  48. Huan X, Marzouk YM. 2013. Simulation-based optimal Bayesian experimental design for nonlinear systems. J. Comput. Phys. 232:1288–317
    [Google Scholar]
  49. Jabot F, Faure T, Dumoulin N. 2013. EasyABC: performing efficient approximate Bayesian computation sampling schemes using R. Methods Ecol. Evol. 4:7684–87
    [Google Scholar]
  50. John RS, Draper NR. 1975. D-optimality for regression designs: a review. Technometrics 17:115–23
    [Google Scholar]
  51. Jordan M. 2011. What are the open problems in Bayesian statistics?. ISBA Bull. 18:1568
    [Google Scholar]
  52. Kale A, Nguyen F, Kay M, Hullman J. 2019. Hypothetical outcome plots help untrained observers judge trends in ambiguous data. IEEE Trans. Visualization Comput. Graph. 25:1892–902
    [Google Scholar]
  53. Kempthorne O 1967. The classical problem of inference–goodness of fit. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1 Statistics LM Le Cam, J Neyman 235–49 Berkeley: Univ. Calif. Press
    [Google Scholar]
  54. Kerr CC, Stuart RM, Mistry D, Abeysuriya RG, Rosenfeld K et al. 2021. Covasim: an agent-based model of COVID-19 dynamics and interventions. PLOS Comput. Biol. 17:7e1009149
    [Google Scholar]
  55. Kim CJ, Nelson CR. 2017. State-Space Models with Regime Switching: Classical and Gibbs-Sampling Approaches with Applications Cambridge, MA: MIT Press
  56. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC et al. 2020. Eleven grand challenges in single-cell data science. Genome Biol. 21:11–35
    [Google Scholar]
  57. Li WV, Li JJ. 2019. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics 35:14i41–50
    [Google Scholar]
  58. Mazur B. 2008. Finding meaning in error terms. Bull. Am. Math. Soc. 45:2185–228
    [Google Scholar]
  59. McElreath R. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan Boca Raton, FL: Chapman and Hall/CRC
  60. Metropolis N, Ulam S. 1949. The Monte Carlo method. J. Am. Stat. Assoc. 44:247335–41
    [Google Scholar]
  61. Müller P. 2005. Simulation based optimal design. Handb. Stat. 25:509–18
    [Google Scholar]
  62. Nguyen LH, Holmes S. 2017. Bayesian unidimensional scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations. BMC Bioinformatics 18:10394
    [Google Scholar]
  63. Prangle D 2018. Summary statistics. Handbook of Approximate Bayesian Computation SA Sisson, Y Fan, M Beaumont 125–52 Boca Raton, FL: Chapman and Hall/CRC
    [Google Scholar]
  64. Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW. 1999. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16:121791–98
    [Google Scholar]
  65. Pyro Contributors 2022. Getting started with Pyro: tutorials, how-to guides and examples Software Documentation https://pyro.ai/examples/
  66. Qin F, Luo X, Xiao F, Cai G 2022. Scrip: an accurate simulator for single-cell RNA sequencing data. Bioinformatics 38:51304–11
    [Google Scholar]
  67. Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. 2017. ZINB-WaVE: a general and flexible method for signal extraction from single-cell RNA-seq data. bioRxiv 125112
  68. Sacks J, Welch WJ, Mitchell TJ, Wynn HP. 1989. Design and analysis of computer experiments. Stat. Sci. 4:4409–23
    [Google Scholar]
  69. Sankaran K, Holmes SP. 2018. Latent variable modeling for the microbiome. Biostatistics 20:4599–614
    [Google Scholar]
  70. Schmid K, Cruceanu C, Böttcher A, Lickert H, Binder E et al. 2021. Design and power analysis for multi-sample single cell genomics experiments. bioRxiv 2020.04.01.019851
  71. Schmitt M, Zhu XX. 2016. Data fusion and remote sensing: an ever-growing relationship. IEEE Geosci. Remote Sens. Mag. 4:46–23
    [Google Scholar]
  72. Shahriari B, Swersky K, Wang Z, Adams RP, De Freitas N. 2015. Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104:1148–75
    [Google Scholar]
  73. Shang L, Zhou X. 2022. Spatially aware dimension reduction for spatial transcriptomics. bioRxiv 2022.01.19.476966
  74. Sisson SA, Fan Y, Beaumont M. 2018. Handbook of Approximate Bayesian Computation Boca Raton, FL: Chapman and Hall/CRC
  75. Solomon J, De Goes F, Peyré G, Cuturi M, Butscher A et al. 2015. Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Graph. 34:41–11
    [Google Scholar]
  76. Soneson C, Robinson MD. 2018. Towards unified quality verification of synthetic count data with countsimQC. Bioinformatics 34:4691–92
    [Google Scholar]
  77. Stephens M. 2013. A unified framework for association analysis with multiple related phenotypes. PLOS ONE 8:7e65245
    [Google Scholar]
  78. Stephens M. 2017. False discovery rates: a new deal. Biostatistics 18:2275–94
    [Google Scholar]
  79. Sun T, Song D, Li WV, Li JJ. 2021. scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol. 22:11–37
    [Google Scholar]
  80. Tavaré S, Balding DJ, Griffiths RC, Donnelly P. 1997. Inferring coalescence times from DNA sequence data. Genetics 145:2505–18
    [Google Scholar]
  81. Teh YW, Jordan MI 2010. Hierarchical Bayesian nonparametric models with applications. Bayesian Nonparametrics NL Hjort, C Holmes, P Muller, SG Walker 158–207 Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  82. Teicher H. 1960. On the mixture of distributions. Ann. Math. Stat. 31:155–73
    [Google Scholar]
  83. Tisue S, Wilensky U. 2004. NetLogo: a simple environment for modeling complexity Presented at the International Conference on Complex Systems May 16–21 Boston, MA:
  84. Townes FW, Engelhardt BE. 2021. Nonnegative spatial factorization. arXiv:2110.06122 [stat.ME]
  85. van de Meent JW, Paige B, Yang H, Wood F 2018. An introduction to probabilistic programming. arXiv:1809.10756 [stat.ML]
  86. Velten B, Braunger JM, Argelaguet R, Arnol D, Wirbel J et al. 2022. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat. Methods 19:179–86
    [Google Scholar]
  87. Wang W, Stephens M. 2021. Empirical Bayes matrix factorization. J. Mach. Learn. Res. 22:1201–40
    [Google Scholar]
  88. Wood F, Meent JW, Mansinghka V. 2014. A new approach to probabilistic programming inference. Proc. Mach. Learn. Res. 33:1024–32
    [Google Scholar]
  89. Yu Z, Du F, Sun X, Li A. 2020. SCSsim: an integrated tool for simulating single-cell genome sequencing data. Bioinformatics 36:41281–82
    [Google Scholar]
  90. Zappia L, Phipson B, Oshlack A. 2017. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18:174
    [Google Scholar]
  91. Zhang X, Xu C, Yosef N. 2019. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10:2611
    [Google Scholar]
  92. Zhang Y, Trippa L, Parmigiani G. 2019. Frequentist operating characteristics of Bayesian optimal designs via simulation. Stat. Med. 38:214026–39
    [Google Scholar]
  93. Zhang Z, Jordan MI. 2009. Latent variable models for dimensionality reduction. Proc. Mach. Learn. Res. 5:655–62
    [Google Scholar]
  94. Zhao S, Gao C, Mukherjee S, Engelhardt BE. 2016. Bayesian group factor analysis with structured sparsity. J. Mach. Learn. Res. 17:1961–47
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-033121-110134
Loading
/content/journals/10.1146/annurev-statistics-033121-110134
Loading

Data & Media loading...

Supplementary Data

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error