1932

Abstract

The field of data science currently enjoys a broad definition that includes a wide array of activities which borrow from many other established fields of study. Having such a vague characterization of a field in the early stages might be natural, but over time maintaining such a broad definition becomes unwieldy and impedes progress. In particular, the teaching of data science is hampered by the seeming need to cover many different points of interest. Data scientists must ultimately identify the core of the field by determining what makes the field unique and what it means to develop new knowledge in data science. In this review we attempt to distill some core ideas from data science by focusing on the iterative process of data analysis and develop some generalizations from past experience. Generalizations of this nature could form the basis of a theory of data science and would serve to unify and scale the teaching of data science to large audiences.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040220-013917
2022-03-07
2024-06-17
Loading full text...

Full text loading...

/deliver/fulltext/statistics/9/1/annurev-statistics-040220-013917.html?itemId=/content/journals/10.1146/annurev-statistics-040220-013917&mimeType=html&fmt=ahah

Literature Cited

  1. Aldwell E, Schachter C, Cadwallader A. 2018. Harmony and Voice Leading New York: Cengage Learn.
    [Google Scholar]
  2. Am. Stat. Assoc. Undergrad. Guidel. Workgr 2014. Curriculum guidelines for undergraduate programs in statistical science Rep., Am. Stat. Assoc. Alexandria, VA:
    [Google Scholar]
  3. Asendorpf JB, Conner M, De Fruyt F, De Houwer J, Denissen JJ et al. 2013. Recommendations for increasing replicability in psychology. Eur. J. Pers. 27:2108–19
    [Google Scholar]
  4. Baggerly K. 2010. Disclose all data in publications. Nature 467:7314401
    [Google Scholar]
  5. Baggerly KA, Coombes KR. 2009. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stat. 3:41309–34
    [Google Scholar]
  6. Becker RA. 1994. A brief history of S. Comput. Stat. 1994:81–110
    [Google Scholar]
  7. boyd d, Crawford K. 2012. Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inform. Commun. Soc. 15:5662–79
    [Google Scholar]
  8. Bressert E. 2012. SciPy and NumPy: An Overview for Developers Sebastopol, CA: O'Reilly
    [Google Scholar]
  9. Broman KW, Woo KH. 2018. Data organization in spreadsheets. Am. Stat. 72:12–10
    [Google Scholar]
  10. Brooks FP Jr. 1995. The Mythical Man-Month: Essays on Software Engineering London: Pearson
    [Google Scholar]
  11. Carver R, Everson M, Gabrosek J, Horton N, Lock R et al. 2016. Guidelines for assessment and instruction in statistics education (GAISE) college report 2016. Rep., Am. Stat. Assoc. Alexandria, VA:
    [Google Scholar]
  12. Chatfield C. 1995. Problem Solving: A Statistician's Guide Boca Raton, FL: Chapman and Hall/CRC
    [Google Scholar]
  13. Craig R. 2020. Why apprenticeships are the best way to learn data skills. Forbes June 18. https://www.forbes.com/sites/ryancraig/2020/06/18/sex-appeal-and-mystery-closing-the-data-skills-gap/?sh=3e7c18a4566a
    [Google Scholar]
  14. Cross N. 2011. Design Thinking: Understanding How Designers Think and Work Oxford, UK: Berg
    [Google Scholar]
  15. Cross N. 2021. Engineering Design Methods: Strategies for Product Design Chichester, UK: Wiley. , 5th ed..
    [Google Scholar]
  16. Danks D, London AJ. 2017. Algorithmic bias in autonomous systems. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Vol. 17 C Sierra 4691–97 Red Hook, NY: Curran
    [Google Scholar]
  17. De Veaux RD, Agarwal M, Averett M, Baumer BS, Bray A et al. 2017. Curriculum guidelines for undergraduate programs in data science. Annu. Rev. Stat. Appl. 4:15–30
    [Google Scholar]
  18. D'Ignazio C, Klein LF 2020. Data Feminism Cambridge, MA: MIT Press
    [Google Scholar]
  19. Donoho D. 2017. 50 years of data science. J. Comput. Graph. Stat. 26:4745–66
    [Google Scholar]
  20. Goldberg P. 2014. Duke scientist: I hope NCI doesn't get original data. Cancer Lett. 41:22
    [Google Scholar]
  21. Goodyear MD, Krleza-Jeric K, Lemmens T. 2007. The declaration of Helsinki. Br. Med. J. 335:7621624–25
    [Google Scholar]
  22. Grolemund G, Wickham H. 2014. A cognitive interpretation of data analysis. Int. Stat. Rev. 82:2184–204
    [Google Scholar]
  23. Hardin J, Hoerl R, Horton NJ, Nolan D. 2015. Data science in statistics curricula: preparing students to “think with data.”. Am. Stat. 69:343–53
    [Google Scholar]
  24. Hirschorn SR. 2007. NASA systems engineering handbook. Tech. Rep., Natl. Aeronaut. Space Admin. Washington, DC:
    [Google Scholar]
  25. IBM 2020. The data science skills competency model. Rep., IBM Analytics Armonk, NY: https://www.ibm.com/downloads/cas/7109RLQM
    [Google Scholar]
  26. Ihaka R, Gentleman R. 1996. R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5:3299–314
    [Google Scholar]
  27. Ioannidis JP. 2005. Why most published research findings are false. PLOS Med. 2:8e124
    [Google Scholar]
  28. Jager LR, Leek JT. 2014. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15:11–12
    [Google Scholar]
  29. Knuth DE. 1984. Literate programming. Comput. J. 27:297–111
    [Google Scholar]
  30. Kross S, Guo PJ. 2019. Practitioners teaching data science in industry and academia: expectations, workflows, and challenges. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems1–14 New York: ACM
    [Google Scholar]
  31. Kross S, Peng RD, Caffo BS, Gooding I, Leek JT. 2020. The democratization of data science education. Am. Stat. 74:11–7
    [Google Scholar]
  32. Leek JT, Peng RD. 2015. Opinion: reproducible research can still be wrong: adopting a prevention approach. PNAS 112:61645–46
    [Google Scholar]
  33. Leonelli S, Lovell R, Wheeler B, Fleming L, Williams H 2021. From FAIR data to fair data use: methodological data fairness in health-related social media research. Big Data Soc. https://doi.org/10.1177/20539517211010310
    [Crossref] [Google Scholar]
  34. Loukides M, Mason H, Patil D 2018. Ethics and Data Science. Sebastopol, CA: O'Reilly
    [Google Scholar]
  35. Lovett MC, Greenhouse JB. 2000. Applying cognitive theory to statistics instruction. Am. Stat. 54:3196–206
    [Google Scholar]
  36. McGowan LD, Peng RD, Hicks SC. 2021. Design principles for data analysis. arXiv:2103.05689 [stat.ME]
  37. McKinney W. 2011. pandas: A foundational Python library for data analysis and statistics. Python High Perform. Sci. Comput. 14:91–9
    [Google Scholar]
  38. Natl. Acad.xs Sci. Eng. Med 2018. Data science for undergraduates: opportunities and options Rep., Natl. Acad. Press Washington, DC:
    [Google Scholar]
  39. Nolan D, Temple Lang D 2010. Computing in the statistics curricula. Am. Stat. 64:297–107
    [Google Scholar]
  40. O'Neil C. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy New York: Crown
    [Google Scholar]
  41. Open Science Collab 2015. Estimating the reproducibility of psychological science. Science 349:6251aac4716
    [Google Scholar]
  42. Paltoo DN, Rodriguez LL, Feolo M, Gillanders E, Ramos EM et al. 2014. Data use under the NIH GWAS data sharing policy and future directions. Nat. Genet. 46:9934–38
    [Google Scholar]
  43. Parker H. 2017. Opinionated analysis development. PeerJ Preprints 5:e3210v1
    [Google Scholar]
  44. Patil P, Peng RD, Leek JT. 2016. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11:4539–44
    [Google Scholar]
  45. Peng RD. 2011. Reproducible research in computational science. Science 334:60601226–27
    [Google Scholar]
  46. Peng RD, Dominici F, Zeger SL. 2006. Reproducible epidemiologic research. Am. J. Epidemiol. 163:9783–89
    [Google Scholar]
  47. Peng RD, Hicks SC. 2021. Reproducible research: a retrospective. Annu. Rev. Public Health 42:79–93
    [Google Scholar]
  48. R Core Team 2021. R: A language and environment for statistical computing. Statistical Software R Found. Stat. Comput. Vienna:
    [Google Scholar]
  49. Radcliffe N. 2015. Why test-driven data analysis?. TDDA Blog, Nov. 5. http://www.tdda.info/why-test-driven-data-analysis
    [Google Scholar]
  50. Robinson E, Nolis J 2020. Build a Career in Data Science Shelter Island, NY: Manning
    [Google Scholar]
  51. Rosenblat A, Kneese T, boyd d 2014. Algorithmic accountability Presented at The Social, Cultural & Ethical Dimensions of “Big Data,” March 17 New York, NY:
    [Google Scholar]
  52. SAS Inst 2015. Base SAS 9.4 procedures guide. Tech. Manual, SAS Inst. Cary, NC:
    [Google Scholar]
  53. Schoenberg A. 1983. Theory of Harmony Berkeley: Univ. Calif. Press
    [Google Scholar]
  54. Schwab M, Karrenbach N, Claerbout J. 2000. Making scientific computations reproducible. Comput. Sci. Eng. 2:661–67
    [Google Scholar]
  55. Thomas D, Hunt A 2019. The Pragmatic Programmer: Your Journey to Mastery Boston, MA: Addison-Wesley Prof.
    [Google Scholar]
  56. Tukey JW. 1962. The future of data analysis. Ann. Math. Stat. 33:11–67
    [Google Scholar]
  57. Vesely WE, Goldberg FF, Roberts NH, Haasl DF. 1981. Fault Tree Handbook Washington, DC: Nucl. Regul. Comm.
    [Google Scholar]
  58. Wakefield J, Shaddick G 2006. Health-exposure modeling and the ecological fallacy. Biostatistics 7:3438–55
    [Google Scholar]
  59. Waller LA. 2018. Documenting and evaluating data science contributions in academic promotion in departments of statistics and biostatistics. Am. Stat. 72:111–19
    [Google Scholar]
  60. Wickham H. 2011. testthat: Get started with testing. R J. 3:5–10
    [Google Scholar]
  61. Wickham H. 2014. Tidy data. J. Stat. Softw. 59:101–23
    [Google Scholar]
  62. Wickham H, Averick M, Bryan J, Chang W, McGowan L et al. 2019. Welcome to the tidyverse. J. Open Source Softw. 4:431686
    [Google Scholar]
  63. Wild CJ, Pfannkuch M. 1999. Statistical thinking in empirical enquiry. Int. Stat. Rev. 67:3223–65
    [Google Scholar]
  64. Wing JM. 2020. Ten research challenge areas in data science. Harvard Data Sci. Rev. https://doi.org/10.1162/99608f92.c6577b1f
    [Crossref] [Google Scholar]
  65. Wing JM, Janeja VP, Kloefkorn T, Erickson LC 2018. Data Science Leadership Summit: summary report Tech. Rep., Natl. Sci. Found. Arlington, VA:
    [Google Scholar]
  66. Woods R. 2019. A design thinking mindset for data science. Towards Data Science Blog, Mar. 22 https://towardsdatascience.com/a-design-thinking-mindset-for-data-science-f94f1e27f90
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040220-013917
Loading
/content/journals/10.1146/annurev-statistics-040220-013917
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error