Social life increasingly occurs in digital environments and continues to be mediated by digital systems. Big data represents the data being generated by the digitization of social life, which we break down into three domains: digital life, digital traces, and digitalized life. We argue that there is enormous potential in using big data to study a variety of phenomena that remain difficult to observe. However, there are some recurring vulnerabilities that should be addressed. We also outline the role institutions must play in clarifying the ethical rules of the road. Finally, we conclude by pointing to a number of nascent but important trends in the use of big data.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Athey S, Imbens G. 2016. Recursive partitioning for heterogeneous causal effects. PNAS 113:2753–60 [Google Scholar]
  2. Ayers JW, Ribisl K, Brownstein JS. 2011. Using search query surveillance to monitor tax avoidance and smoking cessation following the United States’ 2009 “SCHIP” cigarette tax increase. PLOS ONE 6:3e16777 [Google Scholar]
  3. Bail CA. 2012. The fringe effect. Am. Sociol. Rev. 77:655–79 [Google Scholar]
  4. Bakshy E, Hofman JM, Mason WA, Watts DJ. 2011. Everyone's an influencer: quantifying influence on twitter. Proc. 4th ACM Conf. Web Search Data Mining65–74 New York: ACM [Google Scholar]
  5. Bakshy E, Messing S, Adamic LA. 2015. Exposure to ideologically diverse news and opinion on Facebook. Science 348:62391130–32 [Google Scholar]
  6. Barberá P. 2015. Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Polit. Anal. 23:176–91 [Google Scholar]
  7. Barberá P, Wang N, Bonneau R, Jost JT, Nagler J. et al. 2015. The critical periphery in the growth of social protests. PLOS ONE 10:111–15 [Google Scholar]
  8. Beauchamp N. 2016. Predicting and interpolating state‐level polls using Twitter textual data. Am. J. Political Sci. 61:490–503 [Google Scholar]
  9. Bernard HR, Killworth P, Kronenfeld D, Sailer L. 1984. The problem of informant accuracy: the validity of retrospective data. Annu. Rev. Anthropol. 13:495–517 [Google Scholar]
  10. Blanford JI, Huang Z, Savelyev A, MacEachren AM. 2015. Geo-located tweets. Enhancing mobility maps and capturing cross-border movement. PLOS ONE 10:6e0129202 [Google Scholar]
  11. Bond RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C. et al. 2012. A 61-million-person experiment in social influence and political mobilization. Nature 489:7415295–98 [Google Scholar]
  12. Bonica A. 2014. Mapping the ideological marketplace. Am. J. Polit. Sci. 58:2367–86 [Google Scholar]
  13. boyd d, Crawford K. 2012. Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15:5662–79 [Google Scholar]
  14. Brockmann D, Hufnagel L, Geisel T. 2006. The scaling laws of human travel. Nature 439:7075462–65 [Google Scholar]
  15. Brown J, Hossain T, Morgan J. 2010. Shrouded attributes and information suppression: evidence from the field. Q. J. Econ. 125:2859–76 [Google Scholar]
  16. Bu Z, Xia Z, Wang J. 2013. A sock puppet detection algorithm on virtual spaces. Knowl.-Based Syst. 37:366–77 [Google Scholar]
  17. Burt RS. 2012. Network-related personality and the agency question: multirole evidence from a virtual world. Am. J. Sociol. 118:3543–91 [Google Scholar]
  18. Caliskan A, Bryson JJ, Narayanan A. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356:6334183–86 [Google Scholar]
  19. Cavallo A. 2013. Online and official price indexes: measuring Argentina's inflation. J. Monet. Econ. 60:2152–165 [Google Scholar]
  20. Cavallo A. 2017. Scraped data and sticky prices. Rev. Econ. Stat. http://dx.doi.org/10.1162/REST_a_00652 [Crossref] [Google Scholar]
  21. Cavallo A, Neiman B, Rigobon R. 2014. Currency unions, product introductions, and the real exchange rate. Q. J. Econ. 129:2529–95 [Google Scholar]
  22. Cavallo A, Rigobon R. 2016. The Billion Prices Project: using online prices for measurement and research. J. Econ. Perspect. 30:2151–78 [Google Scholar]
  23. Chetty R, Friedman JN, Rockoff JE. 2014a. Measuring the impacts of teachers I: evaluating bias in teacher value-added estimates. Am. Econ. Rev. 104:92593–632 [Google Scholar]
  24. Chetty R, Friedman JN, Rockoff JE. 2014b. Measuring the impacts of teachers II: teacher value-added and student outcomes in adulthood. Am. Econ. Rev. 104:92633–79 [Google Scholar]
  25. Chetty R, Hendren N, Katz LF. 2016. The effects of exposure to better neighborhoods on children: new evidence from the Moving to Opportunity experiment. Am. Econ. Rev. 106:4855–902 [Google Scholar]
  26. Collins PH. 1998. It's all in the family: intersections of gender, race, and nation. Hypatia 13:362–82 [Google Scholar]
  27. Coppersmith G, Harman C, Dredze M. 2014. Measuring post traumatic stress disorder in Twitter. Proc. 8th Int. AAAI Conf. Weblogs Soc. Media579–82 http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8079 [Google Scholar]
  28. Curington CV, Lin K-H, Lundquist JH. 2015. Positioning multiraciality in cyberspace. Am. Sociol. Rev. 80:4764–88 [Google Scholar]
  29. De Choudhury M, Gamon M, Counts S, Horvitz E. 2013. Predicting depression via social media. Proc. 7th Int. AAAI Conf. Weblogs Soc. Media128–37 http://course.duruofei.com/wp-content/uploads/2015/05/Choudhury_Predicting-Depression-via-Social-Media_ICWSM13.pdf [Google Scholar]
  30. De Choudhury M, Kiciman E, Dredze M, Coppersmith G, Kumar M. 2016. Discovering shifts to suicidal ideation from mental health content in social media. Proc. 2016 CHI Conf. Hum. Factors Comput. Syst.2098–2110 New York: ACM Press [Google Scholar]
  31. De Vaan M, Vedres B, Stark D. 2015. Game changer: the topology of creativity. Am. J. Sociol. 120:41144–94 [Google Scholar]
  32. Diekmann A, Jann B, Przepiorka W, Wehrli S. 2014. Reputation formation and the evolution of cooperation in anonymous online markets. Am. Sociol. Rev. 79:165–85 [Google Scholar]
  33. Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM. 2011. Temporal patterns of happiness and information in a global social network: hedonometrics and Twitter. PLOS ONE 6:12e26752 [Google Scholar]
  34. Durgin C. 2016. Inside Donald Trump's Potemkin Twitter army. National Review Apr. 8. http://www.nationalreview.com/article/433870/donald-trumps-twitter-supporters-might-be-fake [Google Scholar]
  35. Eagle N, Pentland AS, Lazer D. 2009. Inferring friendship network structure by using mobile phone data. PNAS 106:3615274–78 [Google Scholar]
  36. Earl J, Martin A, McCarthy JD, Soule SA. 2004. The use of newspaper data in the study of collective action. Annu. Rev. Sociol. 30:165–80 [Google Scholar]
  37. Edelman BG, Luca M. 2014. Digital discrimination: the case of Airbnb.com Working Pap. 14–054, NOM Unit, Harvard Bus. Sch. [Google Scholar]
  38. Ferrara E. 2015. “Manipulation and abuse on social media” by Emilio Ferrara with Ching-man Au Yeung as coordinator. ACM SIGWEB Newsletter, Spring 4:1–9 [Google Scholar]
  39. Ferrara E, Varol O, Davis C, Menczer F, Flammini A. 2016. The rise of social bots. Commun. ACM. 59:796–104 [Google Scholar]
  40. Foucault Welles B. 2014. On minorities and outliers: The case for making big data small. Big Data Soc 1:11–2 [Google Scholar]
  41. Gartner. 2011. Gartner says solving “big datachallenge involves more than just managing volumes of data News Release, June 27. http://www.gartner.com/newsroom/id/1731916 [Google Scholar]
  42. Gentzkow M, Shapiro JM, Taddy M. 2016. Measuring polarization in high-dimensional data: method and application to congressional speech. NBER Work. Pap. 22423, Natl. Bur. Econ. Res. Cambridge, MA:
  43. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. 2009. Detecting influenza epidemics using search engine query data. Nature 457:72321012–14 [Google Scholar]
  44. Goel S, Watts DJ, Goldstein DG. 2012. The structure of online diffusion networks. Proc. 13th ACM Conf. Electron. Commer.623–38 New York: ACM [Google Scholar]
  45. Goldberg A, Srivastava SB, Manian VG, Monroe W, Potts C. 2016. Fitting in or standing out? The tradeoffs of structural and cultural embeddedness. Am. Sociol. Rev. 81:61190–222 [Google Scholar]
  46. Golder SA, Macy MW. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333:60511878–81 [Google Scholar]
  47. Golder SA, Macy MW. 2014. Digital footprints: opportunities and challenges for online social research. Annu. Rev. Sociol. 40:129–52 [Google Scholar]
  48. González MC, Hidalgo CA, Barabási A-L. 2008. Understanding individual human mobility patterns. Nature 453:7196779–82 [Google Scholar]
  49. Green DP, Kern HL. 2012. Modeling heterogeneous treatment effects in survey experiments with Bayesian additive regression trees. Public Opin. Q. 76:3491–511 [Google Scholar]
  50. Greenberg J, Mollick E. 2017. Activist choice homophily and the crowdfunding of female founders. Adm. Sci. Q. 62:341–74 [Google Scholar]
  51. Gupta P, Srinivasan B, Balasubramaniyan V, Ahamad M. 2015. Phoneypot: data-driven understanding of telephony threats Brief. Pap., NDSS Symp. 2015 San Diego, CA: [Google Scholar]
  52. Hall M, Crowder K, Spring A. 2015. Neighborhood foreclosures, racial/ethnic transitions, and residential segregation. Am. Sociol. Rev. 80:3526–49 [Google Scholar]
  53. Hopkins DJ, King G. 2010. A method of automated nonparametric content analysis for social science. Am. J. Political Sci. 54:1229–47 [Google Scholar]
  54. Imai K, Ratkovic M. 2013. Estimating treatment effect heterogeneity in randomized program evaluation. Ann. Appl. Stat. 7:1443–70 [Google Scholar]
  55. Jackson SJ, Foucault Welles B. 2016. #Ferguson is everywhere: initiators in emerging counterpublic networks. Inf. Commun. Soc. 19:3397–418 [Google Scholar]
  56. Japec L, Kreuter F, Berg M, Biemer P, Decker P. et al. 2015. AAPOR report on big data Am. Assoc. Public Opin. Res Oakbrook Terrace, IL: [Google Scholar]
  57. Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y. 2016. Exploring the limits of language modeling. arXiv1602.02410 [cs.CL]
  58. Keegan BC, Brubaker JR. 2015. “Is” to “was”: coordination and commemoration in posthumous activity on Wikipedia biographies. Proc. 18th ACM Conf. Comput. Support. Coop. Work Soc. Comput.533–46 New York: ACM [Google Scholar]
  59. Keegan BC, Gergle D, Contractor N. 2013. Hot off the wiki: structures and dynamics of Wikipedia's coverage of breaking news events. Am. Behav. Sci. 57:5595–622 [Google Scholar]
  60. Kim M, Newth D, Christen P. 2014. Trends of news diffusion in social media based on crowd phenomena. Proc. 23rd Int. Conf. World Wide Web753–58 New York: ACM [Google Scholar]
  61. King G, Pan J, Roberts ME. 2014. Reverse-engineering censorship in China: randomized experimentation and participant observation. Science 345:61991–10 [Google Scholar]
  62. Knigge A, Maas I, van Leeuwen MHD. 2014a. Sources of sibling (dis)similarity: total family impact on status variation in the Netherlands in the nineteenth century. Am. J. Sociol. 120:3908–48 [Google Scholar]
  63. Knigge A, Maas I, van Leeuwen MHD, Mandemakers K. 2014b. Status attainment of siblings during modernization. Am. Sociol. Rev. 79:3549–74 [Google Scholar]
  64. Kossinets G, Watts DJ. 2006. Empirical analysis of an evolving social network. Science 311:575788–90 [Google Scholar]
  65. Kramer ADI, Guillory JE, Hancock JT. 2014. Experimental evidence of massive-scale emotional contagion through social networks. PNAS 111:298788–90 [Google Scholar]
  66. Laney D. 2001. 3D data management: controlling data volume, velocity and variety Res. Note, META Group, Stamford, CT. https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf [Google Scholar]
  67. Lazer D. 2015. The rise of the social algorithm. Science 348:62391090–91 [Google Scholar]
  68. Lazer D, Kennedy R, King G, Vespignani A. 2014. The parable of Google Flu: traps in big data analysis. Science 343:61761203–5 [Google Scholar]
  69. Lazer D, Pentland AS, Adamic L, Aral S, Barabasi AL. et al. 2009. Life in the network: the coming age of computational social science. Science 323:5915721 [Google Scholar]
  70. Leban G, Fortuna B, Brank J, Grobelnik M. 2014. Event Registry: learning about world events from news. Proc. 23rd Int. Conf. World Wide Web107–10 New York: ACM [Google Scholar]
  71. Lee C-S, Ramler I. 2015. Rise of the bots: bot prevalence and its impact on match outcomes in League of Legends. Int. Worksh. Netw. Syst. Support Games (NetGames), Zagreb, Dec. 3–41–6 [Google Scholar]
  72. Lee K, Eoff BD, Caverlee J. 2011. Seven months with the devils: a long-term study of content polluters on Twitter. 5th Int. AAAI Conf. Weblogs Soc. Media. https://pdfs.semanticscholar.org/1dd5/355e62b9fc37a355e135d5909ed28128d653.pdf [Google Scholar]
  73. Leetaru K, Schrodt PA. 2013. GDELT: global data on events, location, and tone, 1979–2012. Int. Stud. Assoc. Annu. Conf., San Diego http://citeseerx.ist.psu.edu/viewdoc/download?doi= [Google Scholar]
  74. Legewie J. 2016. Racial profiling and use of force in police stops: how local events trigger periods of increased discrimination. Am. J. Sociol. 122:2379–424 [Google Scholar]
  75. Legewie J, Schaeffer M. 2016. Contested boundaries: explaining where ethnoracial diversity provokes neighborhood conflict. Am. J. Sociol. 122:1125–61 [Google Scholar]
  76. Leung MD. 2014. Dilettante or Renaissance person? How the order of job experiences affects hiring in an external labor market. Am. Sociol. Rev. 79:1136–58 [Google Scholar]
  77. Lin K-H, Lundquist J. 2013. Mate selection in cyberspace: the intersection of race, gender, and education. Am. J. Sociol. 119:1183–215 [Google Scholar]
  78. Manovich L. 2012. Trending: the promises and the challenges of big social data. Debates in the Digital Humanities 2 MK Gold 460–75 Minneapolis, MN: Univ. Minn. Press [Google Scholar]
  79. Manyika J, Chui M, Brown B, Bughin J, Dobbs R. et al. 2011. Big data: the next frontier for innovation, competition, and productivity Rep., McKinsey Global Inst. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation [Google Scholar]
  80. Margolin D, Lin Y-R, Brewer D, Lazer D. 2013. Matching data and interpretation: towards a Rosetta stone joining behavioral and survey data. 7th Int. AAAI Conf. Weblogs Soc. Media9–10 http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6267 [Google Scholar]
  81. Marsden PV. 1990. Network data and measurement. Annu. Rev. Sociol. 16:435–63 [Google Scholar]
  82. Massey DS, Denton NA. 1993. American Apartheid: Segregation and the Making of the Underclass Cambridge, MA: Harvard Univ. Press [Google Scholar]
  83. Matthew S. 2015. Revealed: how Russia's “troll factory” runs thousands of fake Twitter and Facebook accounts to flood social media with pro-Putin propaganda. The Daily Mail March 28 [Google Scholar]
  84. McCorriston J, Jurgens D, Ruths D. 2015. Organizations are users too: characterizing and detecting the presence of organizations on Twitter. 9th Int. AAAI Conf. Web Soc. Media. http://www-cs.stanford.edu/∼jurgens/docs/mccorriston-jurgens-ruths_icwsm-2015.pdf [Google Scholar]
  85. Michel J-B, Shen YK, Aiden AP, Veres A, Gray MK. et al. 2011. Quantitative analysis of culture using millions of digitized books. Science 331:6014176–82 [Google Scholar]
  86. Mihalcea R, Csomai A. 2007. Wikify!: Linking documents to encyclopedic knowledge. Proc. 16th ACM Conf. Inf. Knowl. Manag.233–42 New York: ACM [Google Scholar]
  87. Monroe BL. 2013. The five Vs of big data political science: introduction to the Virtual Issue on Big Data in Political Science. Polit. Anal. 19:566–86 [Google Scholar]
  88. NRC (Natl. Res. Counc.). 2014. Proposed Revisions to the Common Rule for the Protection of Human Subjects in the Behavioral and Social Sciences Washington, DC: Natl. Acad. Press [Google Scholar]
  89. Onnela J-P, Saramäki J, Hyvönen J, Szabó G, Lazer D. et al. 2007. Structure and tie strengths in mobile communication networks. PNAS 104:187332–36 [Google Scholar]
  90. Onnela J-P, Waber BN, Pentland A, Schnorf S, Lazer D. 2014. Using sociometers to quantify social interaction patterns. Sci. Rep. 4:5604 [Google Scholar]
  91. Ortiz JR, Zhou H, Shay DK, Neuzil KM, Fowlkes AL, Goss CH. 2011. Monitoring influenza activity in the United States: a comparison of traditional surveillance systems with Google Flu Trends. PLOS ONE 6:41–9 [Google Scholar]
  92. Pennington J, Socher R, Manning CD. 2014. GloVe: global vectors for word representation. Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP)1532–43 [Google Scholar]
  93. Perrin A. 2015. Social networking usage: 2005–2015 Rep., Pew Res. Cent., Washington, DC [Google Scholar]
  94. Pestre G, Letouzé E, Zagheni E. 2016. The ABCDE of big data: assessing biases in call-detail records for development estimates Presented at Annu. Bank Conf. Dev. Econ., June 20–21 Washington, DC: http://pubdocs.worldbank.org/pubdocs/publicdoc/2016/6/551311466182785065/Pestre-Letouze-Zagheni-ABCDE-May-2016.pdf [Google Scholar]
  95. Phan TQ, Airoldi EM. 2015. A natural experiment of social network formation and dynamics. PNAS 112:216595–600 [Google Scholar]
  96. Romero DM, Meeder B, Kleinberg J. 2011. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on Twitter. Proc. 20th Int. Conf. World Wide Web695–704 New York: ACM [Google Scholar]
  97. Sevtsuk A, Ratti C. 2010. Does urban mobility have a daily routine? Learning from the aggregate data of mobile networks. J. Urban Technol. 17:141–60 [Google Scholar]
  98. Small ML. 2004. Villa Victoria: The Transformation of Social Capital in a Boston Barrio Chicago: Univ. Chicago Press [Google Scholar]
  99. Squire P. 1988. Why the 1936 Literary Digest poll failed. Public Opin. Q 521125–33 [Google Scholar]
  100. State B, Park P, Weber I, Macy M. 2015. The mesh of civilizations in the global network of digital communication. PLOS ONE 10:5e0122543 [Google Scholar]
  101. Stopczynski A, Pietri R, Pentland A, Lazer D, Lehmann S. 2014a. Privacy in sensor-driven human data collection: a guide for practitioners. arXiv1403.5299 [cs.CY]
  102. Stopczynski A, Sekara V, Sapiezynski P, Cuttone A, Madsen MM. et al. 2014b. Measuring large-scale social networks with high resolution. PLOS ONE 9:4e95978 [Google Scholar]
  103. Sweeney L. 2002. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10:05557–70 [Google Scholar]
  104. Taddy M, Gardner M, Chen L, Draper D. 2016. A nonparametric Bayesian analysis of heterogenous treatment effects in digital experimentation. J. Bus. Econ. Stat. 34:4661–72 [Google Scholar]
  105. Tausczik YR, Pennebaker JW. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29:124–54 [Google Scholar]
  106. Toole JL, Lin Y-R, Muehlegger E, Shoag D, González MC, Lazer D. 2015. Tracking employment shocks using mobile phone data. J. R. Soc. Interface 12:10720150185 [Google Scholar]
  107. Toomet O, Silm S, Saluveer E, Ahas R, Tammaru T. 2015. Where do ethno-linguistic groups meet? How copresence during free-time is related to copresence at home and at work. PLOS ONE 10:5e0126093 [Google Scholar]
  108. Tsur O, Calacci D, Lazer D. 2015. A frame of mind: using statistical models for detection of framing and agenda setting campaigns. Proc. 53rd Annu. Meet. Assoc. Comput. Linguist. 7th Int. Joint Conf. Nat. Lang. Process., Beijing, July 26–311629–38 https://pdfs.semanticscholar.org/f5c8/dbeea0112227486b7fc3bd20a73726ffea88.pdf [Google Scholar]
  109. Tufekci Z. 2014. Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv1403.7400 [cs.SI]
  110. van de Rijt A, Shor E, Ward C, Skiena S. 2013. Only 15 minutes? The social stratification of fame in printed media. Am. Sociol. Rev. 78:2266–89 [Google Scholar]
  111. Vasi IB, Walker ET, Johnson JS, Tan HF. 2015. “No fracking way!” Documentary film, discursive opportunity, and local opposition against hydraulic fracturing in the United States, 2010 to 2013. Am. Sociol. Rev. 80:5934–59 [Google Scholar]
  112. Wang GA, Chen H, Xu JJ, Atabakhsh H. 2006. Automatically detecting criminal identity deception: an adaptive detection algorithm. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 36:5988–99 [Google Scholar]
  113. Wang W, Rothschild D, Goel S, Gelman A. 2015. Forecasting elections with non-representative polls. Int. J. Forecast. 31:3980–91 [Google Scholar]
  114. Wesolowski A, Eagle N, Tatem AJ, Smith DL, Noor AM. et al. 2012. Quantifying the impact of human mobility on malaria. Science 338:6104267–70 [Google Scholar]
  115. Wilson WJ. 1987. The Truly Disadvantaged: The Inner City, the Underclass, and Public Policy Chicago: Univ. Chicago Press [Google Scholar]
  116. Xinhua. 2016. Online P2P lender suspected of $US 7.6 billion fraud. Xinhua Feb. 1. http://news.xinhuanet.com/english/2016-02/01/c_135065022.htm [Google Scholar]
  117. Yang J, Counts S. 2010. Predicting the speed, scale, and range of information diffusion in Twitter. ICWSM 10:355–58 [Google Scholar]
  118. Zheng R, Li J, Chen H, Huang Z. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57:3378–93 [Google Scholar]
  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error