1932

Abstract

Following the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-biodatasci-020722-115651
2023-08-10
2024-05-04
Loading full text...

Full text loading...

/deliver/fulltext/biodatasci/6/1/annurev-biodatasci-020722-115651.html?itemId=/content/journals/10.1146/annurev-biodatasci-020722-115651&mimeType=html&fmt=ahah

Literature Cited

  1. 1.
    Jordan MI, Mitchell TM. 2015. Machine learning: trends, perspectives, and prospects. Science 349:255–60
    [Google Scholar]
  2. 2.
    Harshvardhan GM, Gourisaria MK, Pandey M, Rautaray SS. 2020. A comprehensive survey and analysis of generative models in machine learning. Comput. Sci. Rev. 38:100285
    [Google Scholar]
  3. 3.
    Goodfellow I. 2016. NIPS 2016 tutorial: generative adversarial networks. arXiv:1701.00160 [cs.LG]
  4. 4.
    Liu X, Zhang F, Hou Z, Mian L, Wang Z et al. 2023. Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35:857–76
    [Google Scholar]
  5. 5.
    Zhang Q, Wu YN, Zhu SC. 2018. Interpretable convolutional neural networks. arXiv:1710.00935 [cs.CV]
  6. 6.
    Eraslan G, Avsec Ž, Gagneur J, Theis FJ. 2019. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20:389–403
    [Google Scholar]
  7. 7.
    Routhier E, Mozziconacci J. 2022. Genomics enters the deep learning era. PeerJ 10:e13613
    [Google Scholar]
  8. 8.
    Shen X, Jiang C, Wen Y, Li C, Lu Q. 2022. A brief review on deep learning applications in genomic studies. Front. Syst. Biol. 2:877717
    [Google Scholar]
  9. 9.
    Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D et al. 2014. Generative adversarial networks. arXiv:1406.2661 [stat.ML]
  10. 10.
    Arjovsky M, Chintala S, Bottou L. 2017. Wasserstein GAN. arXiv:1701.07875 [stat.ML]
  11. 11.
    Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A 2017. Improved training of Wasserstein GANs. Proceedings of the 31st International Conference on Neural Information Processing Systems U von Luxburg, I Guyon, S Bengio, H Wallach, R Fergus 5769–79. Red Hook, NY: ACM
    [Google Scholar]
  12. 12.
    Kingma DP, Welling M. 2022. Auto-encoding variational Bayes. arXiv:1312.6114 [stat.ML]. https://doi.org/10.48550/arXiv.1312.6114
    [Crossref]
  13. 13.
    Killoran N, Lee LJ, Delong A, Duvenaud D, Frey BJ. 2017. Generating and designing DNA with deep generative models. arXiv:1712.06148 [cs.LG]
  14. 14.
    Korfmann K, Gaggiotti OE, Fumagalli M 2023. Deep learning in population genetics. Genome Biol. Evol 152evad008
    [Google Scholar]
  15. 15.
    Quang D, Xie X. 2016. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44:e107
    [Google Scholar]
  16. 16.
    Whata A, Chimedza C. 2021. Deep learning for SARS COV-2 genome sequences. IEEE Access 9:59597–611
    [Google Scholar]
  17. 17.
    Adrion JR, Galloway JG, Kern AD. 2020. Predicting the landscape of recombination using deep learning. Mol. Biol. Evol. 37:1790–808
    [Google Scholar]
  18. 18.
    Zhang XM, Liang L, Liu L, Tang MJ. 2021. Graph neural networks and their current applications in bioinformatics. Front. Genet. 12:690049
    [Google Scholar]
  19. 19.
    Escalona M, Rocha S, Posada D. 2016. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet. 17:459–69
    [Google Scholar]
  20. 20.
    Alosaimi S, Bandiang A, van Biljon N, Awany D, Thami PK et al. 2020. A broad survey of DNA sequence data simulation tools. Brief. Funct. Genom. 19:49–59
    [Google Scholar]
  21. 21.
    Xiao T, Zhou W. 2020. The third generation sequencing: the advanced approach to genetic diseases. Transl. Pediatr. 9:163–73
    [Google Scholar]
  22. 22.
    Lotterhos KE, Fitzpatrick MC, Blackmon H. 2022. Simulation tests of methods in evolution, ecology, and systematics: pitfalls, progress, and principles. Annu. Rev. Ecol. Evol. Syst. 53:113–36
    [Google Scholar]
  23. 23.
    Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y. 2012. An overview of population genetic data simulation. J. Comput. Biol. 19:42–54
    [Google Scholar]
  24. 24.
    Su Z, Marchini J, Donnelly P. 2011. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27:2304–5
    [Google Scholar]
  25. 25.
    Wharrie S, Yang Z, Raj V, Monti R, Gupta R et al. 2022. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. bioRxiv 2022.12.22.521552. https://doi.org/10.1101/2022.12.22.521552
    [Crossref]
  26. 26.
    Yelmen B, Decelle A, Ongaro L, Marnetto D, Tallec C et al. 2021. Creating artificial human genomes using generative neural networks. PLOS Genet. 17:e1009303
    [Google Scholar]
  27. 27.
    Redden H, Alper HS. 2015. The development and characterization of synthetic minimal yeast promoters. Nat. Commun. 6:7810
    [Google Scholar]
  28. 28.
    Cai YM, Kallam K, Tidd H, Gendarini G, Salzman A, Patron N. 2020. Rational design of minimal synthetic promoters for plants. Nucleic Acids Res. 48:11845–56
    [Google Scholar]
  29. 29.
    Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R et al. 2020. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11:6141
    [Google Scholar]
  30. 30.
    Zrimec J, Fu X, Muhammad AS, Skrekas C, Jauniskis V et al. 2022. Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun. 13:5099
    [Google Scholar]
  31. 31.
    Gupta A, Zou J. 2019. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1:105–11
    [Google Scholar]
  32. 32.
    Wang Y, Wang H, Wei L, Li S, Liu L, Wang X. 2020. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucleic Acids Res. 48:6403–12
    [Google Scholar]
  33. 33.
    Linder J, Bogard N, Rosenberg AB, Seelig G. 2019. Deep exploration networks for rapid engineering of functional DNA sequences. bioRxiv 864363. https://doi.org/10.1101/864363
  34. 34.
    Hazra D, Kim MR, Byun YC. 2022. Generative adversarial networks for creating synthetic nucleic acid sequences of cat genome. Int. J. Mol. Sci. 23:3701
    [Google Scholar]
  35. 35.
    Marouf M, Machart P, Bansal V, Kilian C, Magruder DS et al. 2020. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11:166
    [Google Scholar]
  36. 36.
    Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR et al. 2015. A global reference for human genetic variation. Nature 526:68–74
    [Google Scholar]
  37. 37.
    Cavalli-Sforza LL. 2005. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6:333–40
    [Google Scholar]
  38. 38.
    Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F et al. 2003. The International HapMap Project.. Nature 426:789–96
    [Google Scholar]
  39. 39.
    Sirugo G, Williams SM, Tishkoff SA. 2019. The missing diversity in human genetic studies. Cell 177:26–31
    [Google Scholar]
  40. 40.
    Fatumo S, Chikowore T, Choudhury A, Ayub M, Martin AR, Kuchenbaecker K. 2022. A roadmap to increase diversity in genomic studies. Nat. Med. 28:243–50
    [Google Scholar]
  41. 41.
    Montserrat DM, Bustamante C, Ioannidis A. 2019. Class-conditional VAE-GAN for local-ancestry simulation. arXiv:1911.13220 [q-bio.GN]
  42. 42.
    Chen J, Mowlaei ME, Shi X. 2020. Population-scale genomic data augmentation based on conditional generative adversarial networks. Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Art. 26 New York: ACM
    [Google Scholar]
  43. 43.
    Das S, Shi X. 2022. Offspring GAN augments biased human genomic data. Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pap. 50 New York: ACM
    [Google Scholar]
  44. 44.
    Perera M, Montserrat DM, Barrabés M, Geleta M, Giró-I-Nieto X, Ioannidis AG 2022. Generative moment matching networks for genotype simulation. 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)1379–83. New York: IEEE
    [Google Scholar]
  45. 45.
    Booker WW, Ray DD, Schrider DR. 2023. This population does not exist: learning the distribution of evolutionary histories with generative adversarial networks. bioRxiv 2022.09.17.508145. https://doi.org/10.1101/2022.09.17.508145
  46. 46.
    Novembre J, Stephens M. 2008. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 40:646–49
    [Google Scholar]
  47. 47.
    Song Y, Westerhuis JA, Aben N, Michaut M, Wessels LFA, Smilde AK. 2019. Principal component analysis of binary genomics data. Brief. Bioinform. 20:317–29
    [Google Scholar]
  48. 48.
    Pearson K. 1901. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2:559–72
    [Google Scholar]
  49. 49.
    Hotelling H. 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24:417–41
    [Google Scholar]
  50. 50.
    van der Maaten L, Hinton G 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9:2579–605
    [Google Scholar]
  51. 51.
    McInnes L, Healy J, Melville J. 2020. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [stat.ML]
  52. 52.
    Ma S, Shi G. 2020. On rare variants in principal component analysis of population stratification. BMC Genet. 21:34
    [Google Scholar]
  53. 53.
    Diaz-Papkovich A, Anderson-Trocmé L, Ben-Eghan C, Gravel S. 2019. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLOS Genet. 15:e1008432
    [Google Scholar]
  54. 54.
    Ausmees K, Nettelblad C. 2022. A deep learning framework for characterization of genotype data. G3 12:jkac020
    [Google Scholar]
  55. 55.
    Battey CJ, Coffing GC, Kern AD. 2021. Visualizing population structure with variational autoencoders. G3 11:jkaa036
    [Google Scholar]
  56. 56.
    Choi Y, Li R, Quon G. 2022. Interpretable deep generative models for genomics. bioRxiv 2021.09.15.460498. https://doi.org/10.1101/2021.09.15.460498
  57. 57.
    Simidjievski N, Bodnar C, Tariq I, Scherer P, Andres Terre H et al. 2019. Variational autoencoders for cancer data integration: design principles and computational practice. Front. Genet. 10:1205
    [Google Scholar]
  58. 58.
    Dwivedi SK, Tjärnberg A, Tegnér J, Gustafsson M. 2020. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat. Commun. 11:856
    [Google Scholar]
  59. 59.
    Seninge L, Anastopoulos I, Ding H, Stuart J. 2021. VEGA is an interpretable generative model for inferring biological network activity in single-cell transcriptomics. Nat. Commun. 12:5684
    [Google Scholar]
  60. 60.
    Svensson V, Gayoso A, Yosef N, Pachter L. 2020. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36:3418–21
    [Google Scholar]
  61. 61.
    Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. 2018. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15:1053–58
    [Google Scholar]
  62. 62.
    Way GP, Greene CS. 2018. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 23:80–91
    [Google Scholar]
  63. 63.
    Wang D, Gu J. 2018. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genom. Proteom. Bioinform. 16:320–31
    [Google Scholar]
  64. 64.
    Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. 2019. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10:390
    [Google Scholar]
  65. 65.
    Grønbech CH, Vording MF, Timshel PN, Sønderby CK, Pers TH, Winther O. 2020. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36:4415–22
    [Google Scholar]
  66. 66.
    Liu Q, Chen S, Jiang R, Wong WH. 2021. Simultaneous deep generative modelling and clustering of single-cell genomic data. Nat. Mach. Intell. 3:536–44
    [Google Scholar]
  67. 67.
    Tan J, Ung M, Cheng C, Greene CS. 2015. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. Pac. Symp. Biocomput. 20:132–43
    [Google Scholar]
  68. 68.
    Ghahramani A, Watt FM, Luscombe NM. 2018. Generative adversarial networks simulate gene expression and predict perturbations in single cells. bioRxiv 262501. https://doi.org/10.1101/262501
  69. 69.
    Kshirsagar M, Yuan H, Ferres JL, Leslie C. 2022. BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin. Genome Biol. 23:174
    [Google Scholar]
  70. 70.
    Allesøe RL, Lundgaard AT, Hernández Medina R, Aguayo-Orozco A, Johansen J et al. 2023. Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models. Nat. Biotechnol. 41:399–408
    [Google Scholar]
  71. 71.
    Meisner J, Albrechtsen A. 2022. Haplotype and population structure inference using neural networks in whole-genome sequencing data. Genome Res. 32:81542–52
    [Google Scholar]
  72. 72.
    Rampášek L, Hidru D, Smirnov P, Haibe-Kains B, Goldenberg A. 2019. Dr.VAE: improving drug response prediction via modeling of drug perturbation effects. Bioinformatics 35:3743–51
    [Google Scholar]
  73. 73.
    Lotfollahi M, Wolf FA, Theis FJ. 2019. scGen predicts single-cell perturbation responses. Nat. Methods 16:715–21
    [Google Scholar]
  74. 74.
    Radford A, Metz L, Chintala S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434 [cs.LG]
  75. 75.
    Park J, Kim H, Kim J, Cheon M. 2020. A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease. PLOS Comput. Biol. 16:e1008099
    [Google Scholar]
  76. 76.
    Davi C, Braga-Neto U. 2021. A semi-supervised generative adversarial network for prediction of genetic disease outcomes. 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) https://doi.org/10.1109/MLSP52302.2021.9596351
    [Google Scholar]
  77. 77.
    Frazer J, Notin P, Dias M, Gomez A, Min JK et al. 2021. Disease variant prediction with deep generative models of evolutionary data. Nature 599:91–95
    [Google Scholar]
  78. 78.
    Wang Z, Wang J, Kourakos M, Hoang N, Lee HH et al. 2021. Automatic inference of demographic parameters using generative adversarial networks. Mol. Ecol. Resour. 21:2689–705
    [Google Scholar]
  79. 79.
    Kelleher J, Etheridge AM, McVean G. 2016. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLOS Comput. Biol. 12:e1004842
    [Google Scholar]
  80. 80.
    Yang H, Gu F, Zhang L, Hua XS. 2022. Using generative adversarial networks for genome variant calling from low depth ONT sequencing data. Sci. Rep. 12:8725
    [Google Scholar]
  81. 81.
    Baid G, Cook DE, Shafin K, Yun T, Llinares-López F et al. 2022. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. 41:232–38
    [Google Scholar]
  82. 82.
    Badsha MB, Li R, Liu B, Li YI, Xian M et al. 2020. Imputation of single-cell gene expression with an autoencoder neural network. Quant. Biol. 8:78–94
    [Google Scholar]
  83. 83.
    Talwar D, Mongia A, Sengupta D, Majumdar A. 2018. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci. Rep. 8:16329
    [Google Scholar]
  84. 84.
    Chen J, Shi X. 2019. Sparse convolutional denoising autoencoders for genotype imputation. Genes 10:E652
    [Google Scholar]
  85. 85.
    Qiu YL, Zheng H, Gevaert O. 2020. Genomic data imputation with variational auto-encoders. GigaScience 9:giaa082
    [Google Scholar]
  86. 86.
    Devlin J, Chang MW, Lee K, Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL]
  87. 87.
    Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J et al. 2020. Language models are few-shot learners. arXiv:2005.14165 [cs.CL]
  88. 88.
    Benegas G, Batra SS, Song YS. 2023. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv 2022.08.22.504706. https://doi.org/10.1101/2022.08.22.504706
    [Crossref]
  89. 89.
    Mo S, Fu X, Hong C, Chen Y, Zheng Y et al. 2021. Multi-modal self-supervised pre-training for regulatory genome across cell types. arXiv:2110.05231 [q-bio.GN]
  90. 90.
    Ji Y, Zhou Z, Liu H, Davuluri RV. 2021. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37:2112–20
    [Google Scholar]
  91. 91.
    Yang M, Huang L, Huang H, Tang H, Zhang N et al. 2022. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 50:e81
    [Google Scholar]
  92. 92.
    Gwak HJ, Rho M. 2022. ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Brief. Bioinform. 23:bbac204
    [Google Scholar]
  93. 93.
    Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C et al. 2021. Big Bird: transformers for longer sequences. arXiv:2007.14062 [cs.LG]
  94. 94.
    Shi X, Wu X. 2017. An overview of human genetic privacy. Ann. N.Y. Acad. Sci. 1387:61–72
    [Google Scholar]
  95. 95.
    Ca A. 2018. Machine learning and genomics: precision medicine versus patient privacy. Philos. Trans. A 376:212820170350
    [Google Scholar]
  96. 96.
    Kim M, Lauter K. 2015. Private genome analysis through homomorphic encryption. BMC Med. Inform. Decis. Making 15:S3
    [Google Scholar]
  97. 97.
    Sim JJ, Chan FM, Chen S, Meng Tan BH, Mi Aung KM 2020. Achieving GWAS with homomorphic encryption. BMC Med. Genom. 13:90
    [Google Scholar]
  98. 98.
    Almadhoun N, Ayday E, Ulusoy Ö. 2020. Differential privacy under dependent tuples—the case of genomic privacy. Bioinformatics 36:1696–703
    [Google Scholar]
  99. 99.
    Rieke N, Hancox J, Li W, Milletarì F, Roth HR et al. 2020. The future of digital health with federated learning. npj Digit. Med. 3:119
    [Google Scholar]
  100. 100.
    Aziz MMA, Anjum MM, Mohammed N, Jiang X. 2022. Generalized genomic data sharing for differentially private federated learning. J. Biomed. Inform. 132:104113
    [Google Scholar]
  101. 101.
    Grishin D, Raisaro JL, Troncoso-Pastoriza JR, Obbad K, Quinn K et al. 2021. Citizen-centered, auditable and privacy-preserving population genomics. Nat. Comput. Sci. 1:192–98
    [Google Scholar]
  102. 102.
    Xie L, Lin K, Wang S, Wang F, Zhou J. 2018. Differentially private generative adversarial network. arXiv:1802.06739 [cs.LG]
  103. 103.
    Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A et al. 2022. Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13:1728
    [Google Scholar]
  104. 104.
    Nie W, Patel A. 2019. Towards a better understanding and regularization of GAN training dynamics. arXiv:1806.09235 [cs.ML]
  105. 105.
    Kurach K, Lučić M, Zhai X, Michalski M, Gelly S. 2019. A large-scale study on regularization and normalization in GANs. Proc. Mach. Learn. Res. 97:3581–90
    [Google Scholar]
  106. 106.
    Dumont V, Ju X, Mueller J. 2022. Hyperparameter optimization of generative adversarial network models for high-energy physics simulations. arXiv:2208.07715 [hep-ex]
  107. 107.
    Zhang Y, Tiňo P, Leonardis A, Tang K. 2021. A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5:726–42
    [Google Scholar]
  108. 108.
    Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. 2022. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24:125–37
    [Google Scholar]
  109. 109.
    Li C, Yao K, Wang J, Diao B, Xu Y, Zhang Q. 2022. Interpretable generative adversarial networks. Proc. AAAI Conf. Artif. Intell. 36:1280–88
    [Google Scholar]
/content/journals/10.1146/annurev-biodatasci-020722-115651
Loading
/content/journals/10.1146/annurev-biodatasci-020722-115651
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error