Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then we review statistical methods of integrative genomics with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D. et al. 2010. An integrated approach to uncover drivers of cancer. Cell 143:1005–17 [Google Scholar]
  2. Ancelet S, Abellan JJ, Del Rio Vilas VJ, Birch C, Richardson S. 2012. Bayesian shared spatial-component models to combine and borrow strength across sparse disease surveillance sources. Biometr. J. 54:385–404 [Google Scholar]
  3. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H. et al. 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25:25–29 [Google Scholar]
  4. Bashashati A, Haffari G, Ding J, Ha G, Lui K. et al. 2012. DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol. 13:R124 [Google Scholar]
  5. Beerenwinkel N, Schwarz RF, Gerstung M, Markowetz F. 2015. Cancer evolution: mathematical models and computational inference. Syst. Biol. 64:e1–e25 [Google Scholar]
  6. Begum F, Ghosh D, Tseng GC, Feingold E. 2012. Comprehensive literature review and statistical considerations for GWAS meta-analysis. Nucleic Acids Res. 40:93777–84 [Google Scholar]
  7. Bergersen LC, Glad IK, Lyng H. 2011. Weighted lasso with data integration. Stat. Appl. Genet. Mol. Biol. 10:1–29 [Google Scholar]
  8. Bhadra A, Mallick BK. 2013. Joint high-dimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics 69:447–57 [Google Scholar]
  9. Bhattacharjee S, Rajaraman P, Jacobs KB, Wheeler WA, Melin BS. et al. 2012. A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am. J. Hum. Genet. 90:5821–35 [Google Scholar]
  10. Bottolo L, Chadeau-Hyam M, Hastie DI, Langley SR, Petretto E. et al. 2011a. ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration. Bioinformatics 27:587–88 [Google Scholar]
  11. Bottolo L, Chadeau-Hyam M, Hastie DI, Zeller T, Liquet B. et al. 2013. GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm. PLoS Genet. 9:e1003657 [Google Scholar]
  12. Bottolo L, Petretto E, Blankenberg S, Cambien F, Cook SA. et al. 2011b. Bayesian detection of expression quantitative trait loci hot spots. Genetics 189:1449–59 [Google Scholar]
  13. Bottolo L, Richardson S. 2010. Evolutionary stochastic search for Bayesian model exploration. Bayesian Anal. 5:583–618 [Google Scholar]
  14. Brem RB, Yvert G, Clinton R, Kruglyak L. 2002. Genetic dissection of transcriptional regulation in budding yeast. Science 296:752–55 [Google Scholar]
  15. Bühlmann P, Kalisch M, Meier L. 2014. High-dimensional statistics with a view toward applications in biology. Annu. Rev. Stat. Appl. 1:255–78 [Google Scholar]
  16. Cai X, Bazerque JA, Giannakis GB. 2013. Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comput. Biol. 9:e1003068 [Google Scholar]
  17. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H. et al. 2012. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30:413–21 [Google Scholar]
  18. Chang L, Lin H, Sibille E, Tseng G. 2013. Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinform. 14:368 [Google Scholar]
  19. Chen LS, Emmert-Streib F, Storey JD. et al. 2007. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 8:R219 [Google Scholar]
  20. Ciriello G, Cerami E, Sander C, Schultz N. 2012. Mutual exclusivity analysis identifies oncogenic network modules. Genome Res. 22:398–406 [Google Scholar]
  21. Cressie NA. 1993. Statistics for Spatial Data New York: Wiley
  22. Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM. et al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486:346–52 [Google Scholar]
  23. Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E. et al. 2004. The Ensembl automatic gene annotation system. Genome Res. 14:942–50 [Google Scholar]
  24. Ding L, Wendl MC, McMichael JF, Raphael BJ. 2014. Expanding the computational toolbox for mining cancer genomes. Nat. Rev. Genet. 15:556–70 [Google Scholar]
  25. Ellis MJ, Gillette M, Carr SA, Paulovich AG, Smith RD. et al. 2013. Connecting genomic alterations to cancer biology with proteomics: the NCI clinical proteomic tumor analysis consortium. Cancer Discov. 3:1108–12 [Google Scholar]
  26. ENCODE Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74 [Google Scholar]
  27. Evangelou E, Ioannidis JP. 2013. Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14:379–89 [Google Scholar]
  28. Fusi N, Stegle O, Lawrence ND. 2012. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol. 8:e1002330 [Google Scholar]
  29. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M. et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–37 [Google Scholar]
  30. Greenberg S, Sanoudou D, Haslett J, Kohane I, Kunkel L. et al. 2002. Molecular profiles of inflammatory myopathies. Neurology 59:1170–82 [Google Scholar]
  31. Hageman RS, Leduc MS, Korstanje R, Paigen B, Churchill GA. 2011. A Bayesian framework for inference of the genotype–phenotype map for segregating populations. Genetics 187:1163–70 [Google Scholar]
  32. Han B, Eskin E. 2012. Interpreting meta-analyses of genome-wide association studies. PLOS Genet. 8:3e1002555 [Google Scholar]
  33. Hans C, Dobra A, West M. 2007. Shotgun stochastic search for “large p” regression. J. Am. Stat. Assoc. 102:507–16 [Google Scholar]
  34. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M. et al. 2012. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22:1760–74 [Google Scholar]
  35. Hofree M, Shen JP, Carter H, Gross A, Ideker T. 2013. Network-based stratification of tumor mutations. Nat. Methods 10:1108–15 [Google Scholar]
  36. Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. 2006. RankProd: A bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 22:2825–27 [Google Scholar]
  37. Huang YT, VanderWeele TJ, Lin X. 2014. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann. Appl. Stat. 8:352 [Google Scholar]
  38. Huo Z, Ding Y, Liu S, Oesterreich S, Tseng GC. 2016. Meta-analytic framework for sparse K-means to identify disease subtypes in multiple transcriptomic studies. J. Am. Stat. Assoc. In press [Google Scholar]
  39. Imholte GC, Scott-Boyer MP, Labbe A, Deschepper CF, Gottardo R. 2013. iBMQ: a R/Bioconductor package for integrated Bayesian modeling of eQTL data. Bioinformatics 29:2797–98 [Google Scholar]
  40. Ishwaran H, Rao JS. 2005. Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Stat. 33:730–73 [Google Scholar]
  41. Jiang Y-h, Bressler J, Beaudet AL. 2004. Epigenetics and human disease. Annu. Rev. Genomics Hum. Genet. 5:479–510 [Google Scholar]
  42. Kang DD, Sibille E, Kaminski N, Tseng GC. 2012. MetaQC: objective quality control and inclusion/exclusion criteria for genomic meta-analysis. Nucleic Acids Res. 40:e15 [Google Scholar]
  43. Kendziorski C, Chen M, Yuan M, Lan H, Attie A. 2006. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics 62:19–27 [Google Scholar]
  44. Khatri P, Sirota M, Butte AJ. 2012. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput. Biol. 8:e1002375 [Google Scholar]
  45. Kim S, Lin C-W, Tseng GC. 2016. MetaKTSP: A meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis. Bioinformatics doi: 10.1093/bioinformatics/btw115
  46. Kim S, Xing EP. 2012. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. Ann. Appl. Stat. 6:1095–117 [Google Scholar]
  47. Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. 2012. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28:3290–97 [Google Scholar]
  48. Knorr-Held L, Best NG. 2001. A shared component model for detecting joint and selective clustering of two diseases. J. R. Stat. Soc. Ser. A 164:73–85 [Google Scholar]
  49. Kratz A, Carninci P. 2014. The devil in the details of RNA-seq. Nat. Biotechnol. 32:882–84 [Google Scholar]
  50. Lee JC, Lyons PA, McKinney EF, Sowerby JM, Carr EJ. et al. 2011. Gene expression profiling of CD8+ T cells predicts prognosis in patients with Crohn disease and ulcerative colitis. J. Clin. Investig. 121:4170 [Google Scholar]
  51. Leiserson MD, Blokh D, Sharan R, Raphael BJ. 2013. Simultaneous identification of multiple driver pathways in cancer. PLoS Comput. Biol. 9:e1003054 [Google Scholar]
  52. Li J, Tseng GC. 2011. An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies. Ann. Appl. Stat. 5:994–1019 [Google Scholar]
  53. Li Q, Wang S, Huang C-C, Yu M, Shao J. 2014. Meta-analysis based variable selection for gene expression data. Biometrics 70:872–80 [Google Scholar]
  54. Li R, Tsaih SW, Shockley K, Stylianou IM, Wergedal J. et al. 2006. Structural model analysis of multiple quantitative traits. PLoS Genet. 2:e114 [Google Scholar]
  55. Lock E, Dunson D. 2013. Bayesian consensus clustering. Bioinformatics 29:2610–16 [Google Scholar]
  56. Logsdon BA, Mezey J. 2010. Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Comput. Biol. 6:e1001014 [Google Scholar]
  57. Malumbres M. 2013. miRNAs and cancer: an epigenetics view. Mol. Aspects Med. 34:863–74 [Google Scholar]
  58. Marttinen P, Pirinen M, Sarin AP, Gillberg J, Kettunen J. et al. 2014. Assessing multivariate gene-metabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics 30:2026–34 [Google Scholar]
  59. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K. et al. 2011. Mapping copy number variation by population-scale genome sequencing. Nature 470:59–65 [Google Scholar]
  60. Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N. et al. 2013. Pattern discovery and cancer gene identification in integrated cancer genomic data. PNAS 110:4245–50 [Google Scholar]
  61. Molitor J, Papathomas M, Jerrett M, Richardson S. 2010. Bayesian profile regression with an application to the national survey of children's health. Biostatistics 11:484–98 [Google Scholar]
  62. Monni S, Tadesse MG. 2009. A stochastic partitioning method to associate high-dimensional responses and covariates. Bayesian Anal. 4:413–36 [Google Scholar]
  63. Muniategui A, Pey J, Planes FJ, Rubio A. 2013. Joint analysis of miRNA and mRNA expression data. Brief. Bioinform. 14:263–78 [Google Scholar]
  64. Neto EC, Ferrara CT, Attie AD, Yandell BS. 2008. Inferring causal phenotype networks from segregating populations. Genetics 179:1089–100 [Google Scholar]
  65. Neto EC, Keller MP, Attie AD, Yandell BS. 2010. Causal graphical models in systems genetics: a unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Ann. Appl. Stat. 4:320–29 [Google Scholar]
  66. Newton MA, Wang Z. 2015. Multiset statistics for gene set analysis. Annu. Rev. Stat. Appl. 2:95–111 [Google Scholar]
  67. Nielsen R, Paul JS, Albrechtsen A, Song YS. 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12:443–51 [Google Scholar]
  68. Pan W, Xie B, Shen X. 2010. Incorporating predictor network in penalized regression with application to microarray data. Biometrics 66:474–84 [Google Scholar]
  69. Papathomas M, Molitor J, Hoggart C, Hastie D, Richardson S. 2012. Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene×gene patterns. Genet. Epidemiol. 36:663–74 [Google Scholar]
  70. Pasquinelli AE. 2012. MicroRNAs and their targets: recognition, regulation and an emerging reciprocal relationship. Nat. Rev. Genet. 13:271–82 [Google Scholar]
  71. Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. 2013. Discovering causal pathways linking genomic events to transcriptional states using tied diffusion through interacting events (TieDIE). Bioinformatics 29:2757–64 [Google Scholar]
  72. Peng J, Zhu J, Bergamaschi A, Han W, Noh D. et al. 2008. Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 4:53–77 [Google Scholar]
  73. Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS. et al. 2000. Molecular portraits of human breast tumours. Nature 406:747–52 [Google Scholar]
  74. Pettit JB, Tomer R, Achim K, Richardson S, Azizi L, Marioni J. 2014. Identifying cell types from spatially referenced single-cell expression datasets. PLoS Comput. Biol. 10:e1003824 [Google Scholar]
  75. Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:2945–59 [Google Scholar]
  76. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38:904–9 [Google Scholar]
  77. Quintana M, Conti D. 2013. Integrative variable selection via Bayesian model uncertainty. Stat. Med. 32:4938–53 [Google Scholar]
  78. Ramasamy A, Mondry A, Holmes CC, Altman DG. 2008. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 5:e184 [Google Scholar]
  79. Rashid N, Sun W, Ibrahim JG. 2014. Some statistical strategies for DAE-seq data analysis: Variable selection and modeling dependencies among observations. J. Am. Stat. Assoc. 109:78–94 [Google Scholar]
  80. Savage RS, Ghahramani Z, Griffin JE, Bernard J, Wild DL. 2010. Discovering transcriptional modules by Bayesian data integration. Bioinformatics 26:i158–67 [Google Scholar]
  81. Scott-Boyer M, Imholte G, Tayeb A, Labbe A, Deschepper C, Gottardo R. 2011. An integrated hierarchical Bayesian model for multivariate eQTL mapping. Stat. Appl. Genet. Mol. Biol. 11:41544–6115 [Google Scholar]
  82. Shabalin AA. 2012. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28:1353–58 [Google Scholar]
  83. Sheehan N, Didelez V, Burton P, Tobin M. 2008. Mendelian randomisation and causal inference in observational epidemiology. PLoS Med. 5e177
  84. Shen K, Tseng GC. 2010. Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics 26:1316–23 [Google Scholar]
  85. Shen R, Olshen AB, Ladanyi M. 2009. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–12 [Google Scholar]
  86. Smith GD. 2007. Capitalizing on Mendelian randomization to assess the effects of treatments. J. R. Soc. Med. 100:432–35 [Google Scholar]
  87. Song C, Tseng GC. 2014. Hypothesis setting and order statistic for robust genomic meta-analysis. Ann. Appl. Stat. 8:777–800 [Google Scholar]
  88. Song L, Crawford GE. 2010. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb. Protoc. 2010:pdb.prot5384 [Google Scholar]
  89. Spirtes P, Glymour C, Scheines R. 2001. Causation, Prediction and Search Cambridge, MA: MIT Press
  90. Stegle O, Parts L, Durbin R, Winn J. 2010. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6:e1000770 [Google Scholar]
  91. Stegle O, Parts L, Piipari M, Winn J, Durbin R. 2012. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7:500–7 [Google Scholar]
  92. Stingo FC, Chen YA, Tadesse MG, Vannucci M. 2011. Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann. Appl. Stat. 5:1978–2002 [Google Scholar]
  93. Stingo FC, Chen YA, Vannucci M, Barrier M, Mirkes PE. 2010. A Bayesian graphical modeling approach to microRNA regulatory network inference. Ann. Appl. Stat. 4:2024–28 [Google Scholar]
  94. Stirzaker C, Taberlay PC, Statham AL, Clark SJ. 2014. Mining cancer methylomes: prospects and challenges. Trends Genet. 30:75–84 [Google Scholar]
  95. Sun W. 2012. A statistical framework for eQTL mapping using RNA-seq data. Biometrics 68:1–11 [Google Scholar]
  96. Sun W, Hu Y. 2013. eQTL mapping using RNA-seq data. Stat. Biosci. 51198–219
  97. Sun W, Wright FA, Tang Z, Nordgard SH, Van Loo P. et al. 2009. Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 37:5365–77 [Google Scholar]
  98. Sun W, Yu T, Li KC. 2007. Detection of eQTL modules mediated by activity levels of transcription factors. Bioinformatics 23:2290–97 [Google Scholar]
  99. Terfve C, Cokelaer T, Henriques D, MacNamara A, Goncalves E. et al. 2012. CellNOptR: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms. BMC Syst. Biol. 6:133 [Google Scholar]
  100. Thompson JR, Attia J, Minelli C. 2011. The meta-analysis of genome-wide association studies. Brief. Bioinform. 12:259–69 [Google Scholar]
  101. Tseng GC, Ghosh D, Feingold E. 2012. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40:3785–99 [Google Scholar]
  102. Tyekucheva S, Marchionni L, Karchin R, Parmigiani G. 2011. Integrating diverse genomic data using gene sets. Genome Biol. 12:1–14 [Google Scholar]
  103. Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH. et al. 2010. Allele-specific copy number analysis of tumors. PNAS 107:16910–15 [Google Scholar]
  104. Vandin F, Upfal E, Raphael BJ. 2011. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18:507–22 [Google Scholar]
  105. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C. et al. 2010. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26:i237–45 [Google Scholar]
  106. Wang K, Li M, Hadley D, Liu R, Glessner J. et al. 2007. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17:1665–74 [Google Scholar]
  107. Wang X, Chua HX, Chen P, Ong RTH, Sim X. et al. 2013. Comparing methods for performing trans-ethnic meta-analysis of genome-wide association studies. Hum. Mol. Genet. 22:2303–11 [Google Scholar]
  108. Wheeler HE, Aquino-Michaels K, Gamazon ER, Trubetskoy VV, Dolan ME. et al. 2014. Poly-omic prediction of complex traits: OmicKriging. Genet. Epidemiol. 38:402–15 [Google Scholar]
  109. Xiong Q, Ancona N, Hauser ER, Mukherjee S, Furey TS. 2012. Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets. Genome Res. 22:386–97 [Google Scholar]
  110. Yoshihara K, Shahmoradgoli M, Martnez E, Vegesna R, Kim H. et al. 2013. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4:2612 [Google Scholar]
  111. Yuan M, Lin Y. 2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68:49–67 [Google Scholar]
  112. Yuan Y, Savage RS, Markowetz F. 2011. Patient-specific data fusion defines prognostic cancer subtypes. PLoS Comput. Biol. 7:e1002227 [Google Scholar]
  113. Zhang W, Zhu J, Schadt EE, Liu JS. 2010. A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules. PLoS Comput. Biol. 6:e1000642 [Google Scholar]
  114. Zheng X, Zhao Q, Wu HJ, Li W, Wang H. et al. 2014. MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol. 15:1–13 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error