Statistical genetics is undergoing the same transition to big data that all branches of applied statistics are experiencing, and this transition is only accelerating with the advent of inexpensive DNA sequencing technology. This brief review highlights some modern techniques with recent successes in statistical genetics. These include () Lasso penalized regression for association mapping, () ethnic admixture estimation, () matrix completion for genotype and sequence imputation, () the fused Lasso for discovery of copy number variation, () haplotyping, () relatedness estimation, () variance components models, and () rare variant testing. For more than a century, genetics has been both a driver and beneficiary of statistical theory and practice. This symbiotic relationship will persist for the foreseeable future.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. 1000 Genomes Proj. Consort 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061–73 [Google Scholar]
  2. ACM-SIGKDD 2007. KDD Cup overview: consumer recommendations http://www.kdd.org/kdd-cup-2007-consumer-recommendations
  3. ACM-SIGKDD, Netflix 2007. Proceedings of the KDD Cup and Workshop 2007 New York: ACM http://www.cs.uic.edu/∼liub/KDD-cup-2007/proceedings.html
  4. Alexander DH, Lange K. 2011a. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinforma. 12:246 [Google Scholar]
  5. Alexander DH, Lange K. 2011b. Stability selection for genome-wide association. Genet. Epidemiol. 35:722–28 [Google Scholar]
  6. Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19:1655–64 [Google Scholar]
  7. Alhaddad H, Khan R, Grahn RA, Gandolfi B, Mullikin JC. et al. 2013. Extent of linkage disequilibrium in the domestic cat, Felis silvestris catus, and its breeds. PLoS ONE 8:e53537 [Google Scholar]
  8. Almasy L, Blangero J. 1998. Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62:1198–211 [Google Scholar]
  9. Asimit J, Day-Williams A, Zgaga L, Rudan I, Boraska V, Zeggini E. 2012a. An evaluation of different meta-analysis approaches in the presence of allelic heterogeneity. Eur. J. Hum. Genet. 20:709–12 [Google Scholar]
  10. Asimit J, Zeggini E. 2010. Rare variant association analysis methods for complex traits. Annu. Rev. Genet. 44:293–308 [Google Scholar]
  11. Asimit JL, Day-Williams AG, Morris AP, Zeggini E. 2012b. ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data. Hum. Hered. 73:84–94 [Google Scholar]
  12. Aulchenko YS, de Koning DJ, Haley C. 2007. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177:577–85 [Google Scholar]
  13. Ayers KL, Lange K. 2008. Penalized estimation of haplotype frequencies. Bioinformatics 24:1596–602 [Google Scholar]
  14. Bacanu SA, Nelson MR, Whittaker JC. 2012. Comparison of statistical tests for association between rare variants and binary traits. PLoS ONE 7:e42530 [Google Scholar]
  15. Bansal V, Libiger O, Torkamani A, Schork NJ. 2010. Statistical analysis strategies for association studies involving rare variants. Nat. Rev. Genet. 11:773–85 [Google Scholar]
  16. Beck A, Teboulle M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2:183–202 [Google Scholar]
  17. Behar DM, Yunusbayev B, Metspalu M, Metspalu E, Rosset S. et al. 2010. The genome-wide structure of the Jewish people. Nature 466:238–42 [Google Scholar]
  18. Blangero J, Diego VP, Dyer TD, Almeida M, Peralta J. et al. 2013. A kernel of truth: statistical advances in polygenic variance component models for complex human pedigrees. Adv. Genet. 81:1–31 [Google Scholar]
  19. Bodmer W, Bonilla C. 2008. Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40:695–701 [Google Scholar]
  20. Bodmer WF. 2010. Commentary: connections between genetics and statistics: a commentary on Fisher's 1951 Bateson lecture—‘Statistical Methods in Genetics.’. Int. J. Epidemiol. 39:340–44 [Google Scholar]
  21. Boehnke M, Cox NJ. 1997. Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61:423–29 [Google Scholar]
  22. Böhning D, Lindsay BG. 1988. Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40:641–63 [Google Scholar]
  23. Borwein JM, Lewis AS. 2006. Convex Analysis and Nonlinear Optimization: Theory and Examples New York: Springer, 2nd ed..
  24. Breheny P, Huang J. 2011. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5:232–53 [Google Scholar]
  25. Broman KW, Sen S. 2009. A Guide to QTL Mapping with R/qtl New York: Springer
  26. Browning SR, Browning BL. 2007. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81:1084–97 [Google Scholar]
  27. Cai JF, Candès E, Shen Z. 2010. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20:1956–82 [Google Scholar]
  28. Candès EJ, Tao T. 2010. The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory 56:2053–80 [Google Scholar]
  29. Cantor RM, Lange K, Sinsheimer JS. 2010. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86:6–22 [Google Scholar]
  30. Chen C, He B, Yuan X. 2012a. Matrix completion via an alternating direction method. IMA J. Numer. Anal. 32:227–45 [Google Scholar]
  31. Chen GK, Wang K, Stram AH, Sobel EM, Lange K. 2012b. Mendel-GPU: haplotyping and genotype imputation on graphics processing units. Bioinformatics 28:2979–80 [Google Scholar]
  32. Chen SS, Donoho DL, Saunders MA. 1998. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20:33–61 [Google Scholar]
  33. Chi EC, Zhou H, Chen GK, Del Vecchyo DO, Lange K. 2013. Genotype imputation via matrix completion. Genome Res. 23:509–18 [Google Scholar]
  34. Claerbout J, Muir F. 1973. Robust modeling with erratic data. Geophysics 38:826–44 [Google Scholar]
  35. Day-Williams AG, Blangero J, Dyer TD, Lange K, Sobel EM. 2011. Linkage analysis without defined pedigrees. Genet. Epidemiol. 35:360–70 [Google Scholar]
  36. de Leeuw J. 1994. Block-relaxation algorithms in statistics. Information Systems and Data Analysis: Prospects, Foundations, Applications HH Bock, W Lenski, MM Richter 308–24 Berlin: Springer [Google Scholar]
  37. Derkach A, Lawless JF, Sun L. 2013. Robust and powerful tests for rare variants using Fisher's method to combine evidence of association from two or more complementary tests. Genet. Epidemiol. 37:110–21 [Google Scholar]
  38. Dziuda DM. 2010. Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data New York: Wiley, 1st ed..
  39. Efron B. 2010. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction New York: Cambridge Univ. Press
  40. Elston RC, Stewart J. 1971. A general model for the genetic analysis of pedigree data. Hum. Hered. 21:523–42 [Google Scholar]
  41. Excoffier L, Slatkin M. 1995. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12:921–27 [Google Scholar]
  42. Friedman J, Hastie T, Höfling H, Tibshirani R. 2007. Pathwise coordinate optimization. Ann. Appl. Stat. 1:302–32 [Google Scholar]
  43. Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33:1–22 [Google Scholar]
  44. Gibson G. 2012. Rare and common variants: twenty arguments. Nat. Rev. Genet. 13:135–45 [Google Scholar]
  45. Haseman JK, Elston RC. 1972. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2:3–19 [Google Scholar]
  46. Henn BM, Gignoux CR, Jobin M, Granka JM, Macpherson JM. et al. 2011. Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proc. Natl. Acad. Sci. USA 108:5154–62 [Google Scholar]
  47. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP. et al. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106:9362–67 [Google Scholar]
  48. Hoffman GE, Logsdon BA, Mezey JG. 2013. PUMA: a unified framework for penalized multiple regression analysis of GWAS data. PLoS Comput. Biol. 9:e1003101 [Google Scholar]
  49. Hopper JL, Mathews JD. 1982. Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 46:373–83 [Google Scholar]
  50. Horvath S. 2011. Weighted Network Analysis: Applications in Genomics and Systems Biology New York: Springer
  51. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. 2012. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44:955–59 [Google Scholar]
  52. Hunter DR, Lange K. 2004. A tutorial on MM algorithms. Am. Stat. 58:30–37 [Google Scholar]
  53. Int. HapMap Consort 2003. The International HapMap Project. Nature 426:789–96 [Google Scholar]
  54. Ionita-Laza I, Buxbaum JD, Laird NM, Lange C. 2011. A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet. 7:e1001289 [Google Scholar]
  55. Jacquard A. 1970. Structures Génétiques des Populations. Paris: Masson Cie
  56. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY. et al. 2010. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42:348–54 [Google Scholar]
  57. Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM. et al. 2012. Exome sequencing and the genetic basis of complex traits. Nat. Genet. 44:623–30 [Google Scholar]
  58. Kijas JW, Lenstra JA, Hayes B, Boitard S, Porto Neto LR. et al. 2012. Genome-wide analysis of the world's sheep breeds reveals high levels of historic mixture and strong recent selection. PLoS Biol. 10:e1001258 [Google Scholar]
  59. Ku CS, Loy EY, Pawitan Y, Chia KS. 2010. The pursuit of genome-wide association studies: Where are we now?. J. Hum. Genet. 55:195–206 [Google Scholar]
  60. Kumar R, Seibold MA, Aldrich MC, Williams LK, Reiner AP. et al. 2010. Genetic ancestry in lung-function predictions. N. Engl. J. Med. 363:321–30 [Google Scholar]
  61. Laird NM, Lange C. 2011. The Fundamentals of Modern Statistical Genetics New York: Springer
  62. Lange K. 2002. Mathematical and Statistical Methods for Genetic Analysis New York: Springer, 2nd ed..
  63. Lange K. 2010. Numerical Analysis for Statisticians New York: Springer
  64. Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM. 2013. Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics 29:1568–70 [Google Scholar]
  65. Latorre V, Diskin SJ, Diamond MA, Zhang H, Hakonarson H. et al. 2012. Replication of neuroblastoma SNP association at the BARD1 locus in African-Americans. Cancer Epidemiol. Biomark. Prev. 21:658–63 [Google Scholar]
  66. Lee SH, Wray NR, Goddard ME, Visscher PM. 2011. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88:294–305 [Google Scholar]
  67. Li B, Leal SM. 2008. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83:311–21 [Google Scholar]
  68. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. 2010. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34:816–34 [Google Scholar]
  69. Liu DJ, Leal SM. 2012. A unified framework for detecting rare variant quantitative trait associations in pedigree and unrelated individuals via sequence data. Hum. Hered. 73:105–22 [Google Scholar]
  70. Long JC, Williams RC, Urbanek M. 1995. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56:799–810 [Google Scholar]
  71. Madsen BE, Browning SR. 2009. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5:e1000384 [Google Scholar]
  72. Mazumder R, Friedman JH, Hastie T. 2011. SparseNet: coordinate descent with nonconvex penalties. J. Am. Stat. Assoc. 106:1125–38 [Google Scholar]
  73. Mazumder R, Hastie T, Tibshirani R. 2010. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11:2287–322 [Google Scholar]
  74. McLachlan GJ, Krishnan T. 2007. The EM Algorithm and Extensions Hoboken, NJ: Wiley, 2nd ed..
  75. Mechanic LE, Chen HS, Amos CI, Chatterjee N, Cox NJ. et al. 2012. Next generation analytic tools for large scale genetic epidemiology studies of complex diseases. Genet. Epidemiol. 36:22–35 [Google Scholar]
  76. Meier L, van de Geer S, Bühlmann P. 2008. The group lasso for logistic regression. J. R. Stat. Soc. B 70:53–71 [Google Scholar]
  77. Meinshausen N, Bühlmann P. 2010. Stability selection. J. R. Stat. Soc. B 72:417–73 [Google Scholar]
  78. Morgenthaler S, Thilly WG. 2007. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat. Res. 615:28–56 [Google Scholar]
  79. Morris GP, Ramu P, Deshpande SP, Hash CT, Shah T. et al. 2013. Population genomic and genome-wide association studies of agroclimatic traits in sorghum. Proc. Natl. Acad. Sci. USA 110:453–58 [Google Scholar]
  80. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B. et al. 2011. Testing for an unusual distribution of rare variants. PLoS Genet. 7:e1001322 [Google Scholar]
  81. Nesterov Y. 2007. Gradient methods for minimizing composite objective function. CORE Discuss. Pap., Univ. Cathol. Louvain, Belg. http://www.ucl.be/cps/ucl/doc/core/documents/coredp2007_76.pdf
  82. Pagani L, Kivisild T, Tarekegn A, Ekong R, Plaster C. et al. 2012. Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool. Am. J. Hum. Genet. 91:83–96 [Google Scholar]
  83. Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N. et al. 2012. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44:631–35 [Google Scholar]
  84. Price AL, Kryukov GV, de Bakker PIW, Purcell SM, Staples J. et al. 2010. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86:832–38 [Google Scholar]
  85. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38:904–9 [Google Scholar]
  86. Pritchard JK, Stephens M, Donnelly P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945–59 [Google Scholar]
  87. Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A. et al. 2010. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463:757–62 [Google Scholar]
  88. Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases. Science 273:1516–17 [Google Scholar]
  89. Sampson J, Jacobs K, Yeager M, Chanock S, Chatterjee N. 2011. Efficient study design for next generation sequencing. Genet. Epidemiol. 35:269–77 [Google Scholar]
  90. Sánchez E, Rasmussen A, Riba L, Acevedo-Vasquez E, Kelly JA. et al. 2012. Impact of genetic ancestry and sociodemographic status on the clinical expression of systemic lupus erythematosus in American Indian–European populations. Arthritis Rheum. 64:3687–94 [Google Scholar]
  91. Santosa F, Symes W. 1986. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 7:1307–30 [Google Scholar]
  92. Scheet P, Stephens M. 2006. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78:629–44 [Google Scholar]
  93. Shah SC, Kusiak A. 2004. Data mining and genetic algorithm based gene/SNP selection. Artif. Intell. Med. 31:183–96 [Google Scholar]
  94. Singh AP, Pe'er I, Zafer S. 2013. MetaSeq: privacy preserving meta-analysis of sequencing-based association studies. Pac. Symp. Biocomput. 2013:356–67 [Google Scholar]
  95. Smith CAB. 1957. Counting methods in genetical statistics. Ann. Hum. Genet. 21:254–76 [Google Scholar]
  96. Stephens M, Smith NJ, Donnelly P. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978–89 [Google Scholar]
  97. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100:9440–45 [Google Scholar]
  98. Strachan T, Read AP. 2011. Human Molecular Genetics New York: Garland Sci./Taylor & Francis Group, 4th ed..
  99. Taylor H, Banks S, McCoy J. 1979. Deconvolution with the ℓ1norm. Geophysics 44:39–52 [Google Scholar]
  100. Thomas DC. 2004. Statistical Methods in Genetic Epidemiology New York: Oxford Univ. Press
  101. Thompson EA. 1974. Gene identities and multiple relationships. Biometrics 30:667–80 [Google Scholar]
  102. Thompson EA. 1975. The estimation of pairwise relationships. Ann. Hum. Genet. 39:173–88 [Google Scholar]
  103. Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58:267–88 [Google Scholar]
  104. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. 2005. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B 67:91–108 [Google Scholar]
  105. van Heel DA, Franke L, Hunt KA, Gwilliam R, Zhernakova A. et al. 2007. A genome-wide association study for celiac disease identifies risk variants in the region harboring IL2 and IL21. Nat. Genet. 39:827–29 [Google Scholar]
  106. Visscher PM, Brown MA, McCarthy MI, Yang J. 2012. Five years of GWAS discovery. Am. J. Hum. Genet. 90:7–24 [Google Scholar]
  107. Williams AL, Patterson N, Glessner J, Hakonarson H, Reich D. 2012. Phasing of many thousands of genotyped samples. Am. J. Hum. Genet. 91:238–51 [Google Scholar]
  108. Williams JT, Blangero J. 1999. Power of variance component linkage analysis to detect quantitative trait loci. Ann. Hum. Genet. 63:545–63 [Google Scholar]
  109. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89:82–93 [Google Scholar]
  110. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. 2009. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25:714–21 [Google Scholar]
  111. Wu TT, Lange K. 2008. Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2:224–44 [Google Scholar]
  112. Wu TT, Lange K. 2010. The MM alternative to EM. Stat. Sci. 25:492–505 [Google Scholar]
  113. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK. et al. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42:565–69 [Google Scholar]
  114. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N. et al. 2011. Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43:519–25 [Google Scholar]
  115. Yuan M, Lin Y. 2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 68:49–67 [Google Scholar]
  116. Zhang T. 2010. Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11:1081–107 [Google Scholar]
  117. Zhang Z, Lange K, Ophoff R, Sabatti C. 2010. Reconstructing DNA copy number by penalized estimation and imputation. Ann. Appl. Stat. 4:1749–73 [Google Scholar]
  118. Zhang Z, Lange K, Sabatti C. 2012. Reconstructing DNA copy number by joint segmentation of multiple sequences. BMC Bioinforma. 13:205 [Google Scholar]
  119. Zhou H, Alexander D, Lange K. 2011a. A quasi-Newton acceleration for high-dimensional optimization algorithms. Stat. Comput. 21:261–73 [Google Scholar]
  120. Zhou H, Alexander DH, Sehl ME, Sinsheimer JS, Sobel EM, Lange K. 2011b. Penalized regression for genome-wide association screening of sequence data. Pac. Symp. Biocomput. 2011106–17
  121. Zhou H, Sehl ME, Sinsheimer JS, Lange K. 2010. Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26:2375–82 [Google Scholar]
  122. Zhou JJ, Ghazalpour A, Sobel EM, Sinsheimer JS, Lange K. 2012. Quantitative trait loci association mapping by imputation of strain origins in multifounder crosses. Genetics 190:459–73 [Google Scholar]
  123. Ziegler A, König IR, Pahlke F. 2010. A Statistical Approach to Genetic Epidemiology: Concepts and Applications Weinheim, Ger.: Wiley-VCH, 2nd ed..

Data & Media loading...

Supplemental Material

Supplementary Data

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error