An important data analysis task in statistical genomics involves the integration of genome-wide gene-level measurements with preexisting data on the same genes. A wide variety of statistical methodologies and computational tools have been developed for this general task. We emphasize one particular distinction among methodologies, namely whether they process gene sets one at a time (uniset) or simultaneously via some multiset technique. Owing to the complexity of collections of gene sets, the multiset approach offers some advantages, as it naturally accommodates set-size variations and among-set overlaps. However, this approach presents both computational and inferential challenges. After reviewing some statistical issues that arise in uniset analysis, we examine two model-based multiset methods for gene list data.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Ackermann M, Strimmer K. 2009. A general modular framework for gene set enrichment analysis. BMC Bioinformatics 10:147 [Google Scholar]
  2. Alexa A, Rahnenfuhrer J, Lengauer T. 2006. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600–7 [Google Scholar]
  3. Ashburner M, Bell CA, Blake JA, Botstein D, Butler H. (Gene Ontology Consortium) 2000. Gene Ontology: tool for the unification of biology. Nat. Genet. 25:25–29 http://www.geneontology.org [Google Scholar]
  4. Barry WT, Nobel AB, Wright FA. 2005. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21:1943–49 [Google Scholar]
  5. Barry WT, Nobel AB, Wright FA. 2008. A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2:286–315 [Google Scholar]
  6. Bauer S, Gagneur J, Robinson PN. 2010. GOing Bayesian: model-based gene set analysis of genome-scale data. Nucleic Acids Res. 38:113523–32 [Google Scholar]
  7. Bauer S, Robinson PN, Gagneur J. 2011. Model-based gene set analysis for Bioconductor. Bioinformatics 27:131882–83 [Google Scholar]
  8. Croft D, Mundo AF, Haw R, Milacic M, Weiser J. et al. 2014. The Reactome pathway knowledgebase. Nucleic Acids Res. 42:1D472–77 [Google Scholar]
  9. Dudoit S, van der Laan MJ. 2007. Multiple Testing Procedures with Applications to Genomics New York: Springer [Google Scholar]
  10. Efron B. 2004. Large-scale simultaneous hypothesis testing: the choice of the null hypothesis. J. Am. Stat. Assoc. 99:96–104 [Google Scholar]
  11. Efron B, Tibshirani R. 2007. On testing the significance of sets of genes. Ann. Appl. Stat. 1:107–29 [Google Scholar]
  12. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M. et al. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5:10R80 [Google Scholar]
  13. Gillis J, Pavlidis P. 2013. Assessing identity, redundancy and confounds in Gene Ontology annotations over time. Bioinformatics 29:4476–82 [Google Scholar]
  14. Goeman JJ, Bühlmann P. 2007. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23:8980–87 [Google Scholar]
  15. Grossmann S, Bauer S, Robinson PN, Vingron M. 2007. Improved detection of overrepresentation of Gene-Ontology annotations with parent–child analysis. Bioinformatics 23:3024–31 [Google Scholar]
  16. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E. et al. 2005. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33:D428–32 http://www.reactome.org [Google Scholar]
  17. Kanehisa M, Goto S. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28:27–30 http://www.genome.jp/kegg [Google Scholar]
  18. Khatri P, Drăghici S. 2005. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21:3587–95 [Google Scholar]
  19. Khatri P, Sirota M, Butte AJ. 2012. Ten years of pathway analysis: current approaches and outstanding challenges. PLOS Comput. Biol. 8:e1002375 [Google Scholar]
  20. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. 2011. Molecular Signatures Database (MSigDB) 3.0. Bioinformatics 27:1739–40 http://www.broadinstitute.org/gsea/msigdb [Google Scholar]
  21. Lu Y, Rosenfeld R, Simon I, Nau GJ, Bar-Joseph Z. 2008. A probabilistic generative model for GO enrichment analysis. Nucleic Acids Res. 36:e109 [Google Scholar]
  22. Maciejewski H. 2014. Gene set analysis methods: statistical models and methodological differences. Brief. Bioinformat. 15:504–18 [Google Scholar]
  23. Newton MA, He Q, Kendziorski C. 2012. A model-based analysis to infer the functional content of a gene list. Stat. Appl. Genet. Mol. Biol. 11:2 doi:10.2202/1544-6115.1716 [Google Scholar]
  24. Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P. 2007. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1:185–106 [Google Scholar]
  25. R Core Team 2014. R: A Language and Environment for Statistical Computing Vienna: R Found. Stat. Comput http://www.R-project.org [Google Scholar]
  26. Sartor MA, Leikauf GD, Medvedovic M. 2009. LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bionformatics 25:211–17 [Google Scholar]
  27. Sengupta S, den Boon JA, Chen I-H, Newton MA, Dahl DB. et al. 2006. Genome-wide expression profiling reveals EBV-associated inhibition of MHC class I expression in nasopharyngeal carcinoma. Cancer Res. 66:167999–8006 [Google Scholar]
  28. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL. et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102:15545–50 [Google Scholar]
  29. Supek F, Bošnjak M, Skunca N, Smuc T. 2011. REVIGO summarizes and visualizes long lists of Gene Ontology terms. PLOS ONE 6:7e21800 [Google Scholar]
  30. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS. et al. 2005. Discovering statistically significant pathways in expression profiling studies. PNAS 102:13544–49 [Google Scholar]
  31. Wang Z, He Q, Larget B, Newton MA. 2013. A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis. Ann. Appl. Stat. In press. arXiv:1310.6322 [stat.ME] [Google Scholar]
  32. Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK. 2010. ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26:172176–82 [Google Scholar]
  33. Wu D, Smyth GK. 2012. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 40:17e133 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error