1932

Abstract

This article discusses the role of data visualization in the process of analyzing big data. We describe the historical origins of statistical graphics, from the birth of exploratory data analysis to the impacts of statistical graphics on practice today. We present examples of contemporary data visualizations in the process of exploring airline traffic, global standardized test scores, election monitoring, Wikipedia edits, the housing crisis as observed in San Francisco, and the mining of credit card databases. We provide a review of recent literature. Good data visualization yields better models and predictions and allows for the discovery of the unexpected.

Associated Article

There are media items related to this article:
Data Visualization and Statistical Graphics in Big Data Analysis: Video 1

Associated Article

There are media items related to this article:
Data Visualization and Statistical Graphics in Big Data Analysis: Video 2

Associated Article

There are media items related to this article:
Data Visualization and Statistical Graphics in Big Data Analysis: Figure 9
Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033420
2016-06-01
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033420.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033420&mimeType=html&fmt=ahah

Literature Cited

  1. Asimov D. 1985. The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6:128–43 [Google Scholar]
  2. Avila D, Cottam J, Dodia K, Doig C, Paprocki M. et al. 2015. Bokeh: Python interactive visualization library. Data Visualization Software. http://bokeh.pydata.org/en/latest/index.html
  3. Baddeley A, Chang YM, Song Y, Turner R. 2013. Residual diagnostics for covariate effects in spatial point process models. J. Comput. Graph. Stat. 22:886–905 [Google Scholar]
  4. Bostock M, Ogievetsky V, Heer J. 2011. D3 data-driven documents. IEEE Trans. Vis. Comput. Graph. 17:2301–9 [Google Scholar]
  5. Brierley P. 2011. What's going on here. Another Data Mining Blog Dec. 17. http://www.anotherdataminingblog.blogspot.co.uk/2011/12/whats-going-on-here.html
  6. Buja A, Cook D, Hofmann H, Lawrence M, Lee E. et al. 2009. Statistical inference for exploratory data analysis and model diagnostics. Philos. Trans. R. Soc. A 367:4361–83 [Google Scholar]
  7. Carr DB. 1995. Using gray in plots. Stat. Comput. Stat. Graph. Newsl. 5:11–14 [Google Scholar]
  8. Carr DB, Lewin-Koh N, Maechler M. 2014. hexbin: hexagonal binning routines. R Software Package for Binning and Plotting. http://cran.r-project.org/web/packages/hexbin/index.html
  9. Carr DB, Littlefield RJ, Nicholson WL, Littlefield JS. 1987. Scatterplot matrix techniques for large N. J. Am. Stat. Assoc. 82:424–36 [Google Scholar]
  10. Carr DB, Nusser S. 1996. Converting tables to plots: a challenge from Iowa State. Stat. Comput. Stat. Graph. Newsl. 6:11–18 [Google Scholar]
  11. Chang W, Wickham H. 2015. ggvis: interactive web graphics with R. R Software Package for Data Visualization. http://ggvis.rstudio.com/
  12. Chelaru F, Smith L, Goldstein N, Bravo HC. 2014. Epiviz: interactive visual analytics for functional genomics data. Nat. Methods 11:938–40 [Google Scholar]
  13. Cleveland WS. 1993. Visualizing Data Summit, NJ: Hobart
  14. Cleveland WS, Grosse E, Shyu WM. 1992. Local regession models. Statistical Models in S JM Chambers, T Hastie 309–76 New York: Chapman & Hall [Google Scholar]
  15. Crowder MJ, Hand DJ. 1990. Analysis of Repeated Measures London: Chapman & Hall
  16. Dang TN, Wilkinson L. 2014. ScagExplorer: exploring scatterplots by their scagnostics. 2014 IEEE Pacific Visualization Symposium (PacificVis)73–80 Piscataway, NJ: IEEE
  17. De Jonge E, Tennekes M. 2013. Tabplotd3: interactive inspection of large data. R Software Package for Data Visualization. http://cran.r-project.org/web/packages/tabplotd3/index.html
  18. Dey T, Phillips DJ, Steele P. 2011. A graphical tool to visualize predicted minimum delay flights. J. Comput. Graph. Stat. 20:294–97 [Google Scholar]
  19. Emerson JW, Green WA, Schloerke B, Crowley J, Cook D. et al. 2013. The generalized pairs plot. J. Comput. Graph. Stat. 22:79–91 [Google Scholar]
  20. Feinberg J. 2010. Wordle.. Beautiful Visualization: Looking at Data Through the Eyes of Experts J Steele, N Iliinsky 37–58 Sebastopol, CA: O'Reilly [Google Scholar]
  21. Friedman JH, Stuetzle W. 2002. John W. Tukey's work on interactive graphics. Ann. Stat. 30:1629–39 [Google Scholar]
  22. Friendly M. 2014. Comment on the generalized pairs plot. J. Comput. Graph. Stat. 23:290–91 [Google Scholar]
  23. Gelman A, Unwin A. 2013. Infovis and statistical graphics: different goals, different looks. J. Comput. Graph. Stat. 22:2–28 [Google Scholar]
  24. Guha PK, Kidwell P, Hafen RP, Cleveland WS. 2009. Visualization databases for the analysis of large complex datasets. JMLR Workshop Conf. Proc. Vol. 5: Proc. 12th Int. Conf. Artif. Intell. Stat., Clearwater Beach, FL, April 16–18 D van Dyk, M Welling 193–200 Berkeley, CA: Microtome
  25. Hafen RP, Cleveland WS. 2015. Tessera. Data Analysis and Visualization Software. http://tessera.io/
  26. Hafen RP, Russell K, Owen J. 2015. Rbokeh: R interface for Bokeh. R Software Package for Data Visualization. R package version 0.2.3.2. http://hafen.github.io/rbokeh/rd.html
  27. Hand DJ, Blunt G, Kelly MG, Adams NM. 2000. Data mining for fun and profit. Stat. Sci. 15:111–31 [Google Scholar]
  28. Hartigan JA. 1975. Printer graphics for clustering. J. Stat. Comput. Simul. 4:187–213 [Google Scholar]
  29. Hocking TD, VanderPlas S, Sievert C. 2015. Animint: interactive animations. R Software Package for Data Visualization. http://github.com/tdhock/animint
  30. Hofert M, Mächler M. 2014. A graphical goodness-of-fit test for dependence models in higher dimensions. J. Comput. Graph. Stat. 23:700–16 [Google Scholar]
  31. Hofmann H, Cook D, Kielion C, Schloerke B, Hobbs J. et al. 2011. Delayed, canceled, on time, boarding… flying in the USA. J. Comput. Graph. Stat. 20:287–90 [Google Scholar]
  32. Hofmann H, Follett L, Majumder M, Cook D. 2012. Graphical tests for power comparison of competing designs. IEEE Trans. Vis. Comput. Graph. 18:2441–48 [Google Scholar]
  33. Hofmann H, Vendettuoli M. 2013a. Common angle plots as perception-true visualizations of categorical associations. IEEE Trans. Vis. Comput. Graph. 19:2297–305 [Google Scholar]
  34. Hofmann H, Vendettuoli M. 2013b. Ggparallel: variations of parallel coordinate plots for categorical data. R Software Package for Data Visualization. http://CRAN.R-project.org/package=ggparallel
  35. Hurley C, Oldford R. 2011a. Eulerian tour algorithms for data visualization and the PairViz package. Comput. Stat. 26:613–33 [Google Scholar]
  36. Hurley C, Oldford R. 2011b. PairViz: visualization using Eulerian tours and Hamiltonian decompositions. R Software Package for Data Visualization. http://cran.r-project.org/web/packages/PairViz/index.html
  37. Hyndman RJ, Shang HL. 2010. Rainbow plots, bagplots and boxplots for functional data. J. Comput. Graph. Stat. 19:29–45 [Google Scholar]
  38. Inselberg A. 1985. The plane with parallel coordinates. Vis. Comput. 1:69–91 [Google Scholar]
  39. Jockers ML. 2014. Text Analysis with R for Students of Literature New York: Springer
  40. Kaplan A, Hare E, Hofmann H, Cook D. 2010. Can you buy a president? Politics after the Tillman Act. Chance 27. http://chance.amstat.org/2014/02/president/
  41. Lins L, Klosowski JT, Scheidegger C. 2013. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph. 19:2456–65 [Google Scholar]
  42. Majumder M, Hofmann H, Cook D. 2013. Validation of visual statistical inference, applied to linear models. J. Am. Stat. Assoc. 108:503942–56 [Google Scholar]
  43. Mosley L, Cook D, Hofmann H, Kielion C, Schloerke B. 2010. Visually monitoring the 2008 election. Chance 23
  44. Moustafa RE, Hadia AS, Symanzik J. 2011. Multi-class data exploration using space transformed visualization plots. J. Comput. Graph. Stat. 20:298–315 [Google Scholar]
  45. Murrell P, Potter S. 2015. GridSVG: export grid graphics as SVG. R Software Package for Data Visualization https://cran.r-project.org/web/packages/gridSVG/index.html
  46. Newell M, Cook D, Hofmann H, Jannink JL. 2013. An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure. Ann. Appl. Stat. 7:1898–916 [Google Scholar]
  47. R Development Core Team 2014. R: a language and environment for statistical computing Vienna: R Found. Stat. Comput.
  48. RStudio 2015. Shiny. Web application framework for R http://shiny.rstudio.com/
  49. Sadana R, Major T, Dove A, Stasko J. 2014. Onset: a visualization technique for large-scale binary set data. IEEE Trans. Vis. Comput. Graph. 20:1993–2002 [Google Scholar]
  50. Schonlau M. 2003. Visualizing categorical data arising in the health sciences using hammock plots. Proc. Sect. Stat. Graph. Am. Stat. Assoc. http://www.schonlau.net/publication/03jsm_hammockplot.pdf
  51. Shvachko K, Kuang H, Radia S, Chansler R. 2010. The Hadoop distributed file system. Proc. 2010 IEEE 26th Symp. Mass Storage Syst. Technol.1–10 Piscataway, NJ: IEEE [Google Scholar]
  52. Sievert C, Shirley KE. 2014. LDAvis: a method for visualizing and interpreting topics. Proc. Workshop Interact. Lang. Learn. Vis. Interfaces, Baltimore, MD, June 2763–70 Stroudsburg, PA: Assoc. Comput. Linguist. [Google Scholar]
  53. Steele J, Iliinsky N. 2010. Beautiful Visualization: Looking at Data Through the Eyes of Experts. Sebastopol, CA: O'Reilly Media [Google Scholar]
  54. Stolte C, Chabot C, Hanrahan P. 2003. Tableau. Software for Business Intelligence and Analytics. http://www.tableau.com/
  55. Tennekes M, De Jonge E. 2014. Tabplot: Tableplot, a visualization of large datasets. R Software Package for Data Visualization. http://cran.r-project.org/web/packages/tabplot/index.html
  56. Tierney L. 1991. LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics New York: Wiley
  57. Tufte E. 1983. The Visual Display of Quantitative Information Cheshire, CT: Graphics
  58. Tukey JW. 1986. Sunset salvo. Am. Stat. 40:72–76 [Google Scholar]
  59. Tukey JW, Tukey PA. 1985. Computer graphics and exploratory data analysis: an introduction. The Collected Works of John W. Tukey: Graphics 1965–1985 5 WS Cleveland 419–38 New York: Chapman & Hall [Google Scholar]
  60. Urbanek S. 2015a. Ioplots. High-performance I/O tools to run distributed R jobs seamlessly on Hadoop. https://github.com/s-u/iotools
  61. Urbanek S. 2015b. RCloud. Software for Collaboratively Developing and Sharing R Scripts. http://stats.research.att.com/RCloud/
  62. Urbanek S, Theus M. 2003. iPlots—high interaction graphics for R. Proc. 3rd Int. Workshop Distrib. Stat. Comput., Vienna, March 20–22 K Hornik, F Leisch, A Zeileis. https://www.r-project.org/conferences/DSC-2003/Proceedings/UrbanekTheus.pdf
  63. Vaidyanathan R, Xie Y, Allaire J, Cheng J, Russell K. 2015. HtmlWidgets: html widgets for R. http://www.htmlwidgets.org
  64. Van Long T, Linsen L. 2011. Visualizing high density clusters in multidimensional data using optimized star coordinates. Comput. Stat. 26:655–78 [Google Scholar]
  65. Wattenberg M, Viégas F. 2010. Beautiful history: visualizing Wikipedia. Beautiful Visualization: Looking at Data Through the Eyes of Experts J Steele, N Iliinsky 175–91 Sebastopol, CA: O'Reilly [Google Scholar]
  66. Wattenberg M, Viegas FB, Hollenbach K. 2007. Visualizing activity on Wikipedia with chromograms. Human-Computer Interaction–INTERACT 2007272–87 Berlin: Springer [Google Scholar]
  67. Wegman E. 1990. Hyperdimensional data analysis using parallel coordinates. J. Am. Stat. Assoc. 85:664–75 [Google Scholar]
  68. Wickham C. 2011a. A tale of two airports: exploring flight traffic at SFO and OAK. J. Comput. Graph. Stat. 20:291–93 [Google Scholar]
  69. Wickham H. 2011b. The split-apply-combine strategy for data analysis. J. Stat. Software 40:1–29 [Google Scholar]
  70. Wickham H. 2013. Bin-summarise-smooth: a framework for visualising large data Tech. Rep. http://vita.had.co.nz/papers/bigvis.html
  71. Wickham H, Chang W. 2014. Ggplot2: an implementation of the grammar of graphics. R Software Package for Data Visualization http://cran.r-project.org/web/packages/ggplot2/index.html
  72. Wickham H, Chowdhury NR, Cook D. 2014. Nullabor: tools for graphical inference. R Software Package for Data Visualization http://cran.r-project.org/web/packages/nullabor/index.html
  73. Wickham H, Cook D, Hofmann H. 2015a. Visualizing statistical models: removing the blindfold. Stat. Anal. Data Min. 8:203–25 [Google Scholar]
  74. Wickham H, Francois R, RStudio. 2015b. Dplyr: a grammar of data manipulation. R Software Package for Data Manipulation. http://cran.r-project.org/web/packages/dplyr/index.html
  75. Wickham H, Lawrence M, Lang DT, Swayne DF. 2008. An introduction to rggobi. R-news 8:3–7 [Google Scholar]
  76. Wickham H, Swayne DF, Poole D. 2009. Bay Area blues: the effect of the housing crisis. Beautiful Data: The Stories Behind Elegant Data Solutions T Segaran, J Hammerbacher 303–19 Sebastopol, CA: O'Reilly [Google Scholar]
  77. Wicklin R. 2011. Visualizing airline delays and cancelations. J. Comput. Graph. Stat. 20:284–86 [Google Scholar]
  78. Wilkinson L, Anand A, Grossman RL. 2005. Graph-theoretic scagnostics. IEEE Symposium on Information Visualization (InfoVis 05), Minneapolis, Minn., October 23–25157–64 Piscataway, NJ: IEEE
  79. Wilkinson L, Anand A, Urbanek S. 2012. Scagnostics: compute scagnostics - scatterplot diagnostics. R Software Package for Data Analysis. http://cran.r-project.org/web/packages/scagnostics/index.html
  80. Xie Y, Hofmann H, Cheng X. 2014. Reactive programming for interactive graphics. Stat. Sci. 29:201–13 [Google Scholar]
/content/journals/10.1146/annurev-statistics-041715-033420
Loading
/content/journals/10.1146/annurev-statistics-041715-033420
Loading

Data & Media loading...

Supplemental Material

    Still image of video showing plane movements across the United States on a normal day of operations, January 19, 2006. The video includes red-eye planes leaving the West Coast for the East Coast, the East Coast waking up, and sporadic delayed flights. The code (including links to the data) is available at . An accompanying video at shows operations during a northeastern snow day, March 13, 1993. Used with the permission of Heike Hofmann.

Supplementary Data

    Use of interactive graphics to explore rankings of statistics departments in the United States. The plots show rating variables as side-by-side dotplots (), a cluster analysis (), and a scatterplot of 5th-percentile rank computed using the S (vertical) and R (horizontal) methods (), and institution name lookup (). Selecting an institution highlights () its values in each of the other plots. Cornell University is highlighted: We can see that its rank by the two methods differs substantially, with a good R rank (around 5) but not such a good S rank (around 30). On the ranking criteria, the department is around the middle of the pack: It is average in terms of number of publications and citations, has few women faculty and students, and accepts students with lower GRE scores than most statistics departments.

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error