Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges

John Wilkerson; Andreu Casas

doi:10.1146/annurev-polisci-052615-025542

Annual Review of Political Science

Volume 20, 2017

Review Article

Free

Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges

John Wilkerson¹, and Andreu Casas¹
View Affiliations Hide Affiliations

Affiliations: Department of Political Science, University of Washington, Seattle, Washington 98195; email: [email protected]
Vol. 20:529-544 (Volume publication date May 2017) https://doi.org/10.1146/annurev-polisci-052615-025542
© Annual Reviews

Abstract

Text has always been an important data source in political science. What has changed in recent years is the feasibility of investigating large amounts of text quantitatively. The internet provides political scientists with more data than their mentors could have imagined, and the research community is providing accessible text analysis software packages, along with training and support. As a result, text-as-data research is becoming mainstream in political science. Scholars are tapping new data sources, they are employing more diverse methods, and they are becoming critical consumers of findings based on those methods. In this article, we first describe the four stages of a typical text-as-data project. We then review recent political science applications and explore one important methodological challenge—topic model instability—in greater detail.

Keyword(s): automatic coding, computational social sciences, machine learning, text as data

Article metrics loading...

/content/journals/10.1146/annurev-polisci-052615-025542

2017-05-11

2024-04-18

Full text loading...

/deliver/fulltext/polisci/20/1/annurev-polisci-052615-025542.html?itemId=/content/journals/10.1146/annurev-polisci-052615-025542&mimeType=html&fmt=ahah

Literature Cited

Alvarez RM. 2016. Computational Social Science: Discovery and Prediction. Analytical Methods for Social Research. New York: Cambridge Univ. Press
Arlot C. 2010. A survey of cross-validation procedures for model selection. Stat. Surv. 4:40–79 [Google Scholar]
Barbera P. 2015. Birds of the same feather tweet together. Bayesian ideal point estimation using Twitter data. Polit. Anal. 23:176–91 [Google Scholar]
Beauchamp N. 2017. Predicting and interpolating state-level polls using Twitter textual data. Am. J. Polit. Sci. In press. doi: 10.1111/ajps.12274
Benoit K, Conway D, Lauderdale B, Laver M, Mikhaylov S. 2016. Crowd-sourced text analysis: reproducible and agile production of political data. Am. Polit. Sci. Rev. 110:2278–95 [Google Scholar]
Bird S, Klein E, Loper E. 2009. Natural Language Processing with Python—Analyzing Text with the Natural Language Toolkit Sebastopol, CA: O'Reilly Media
Blei D, Lafferty J. 2009. Topic models. Text Mining: Classification, Clustering, and Applications ed. AN Srivastava, M Sahami 71–94 Data Mining and Knowledge Discovery Ser Boca Raton, FL: Chapman & Hall/CRC [Google Scholar]
Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3:993–1022 [Google Scholar]
Boussalis C, Coan TG. 2016. Text-mining the signals of climate change doubt. Glob. Environ. Change 36:89–100 [Google Scholar]
Boyd-Graber J, Mimno D, Newman D. 2014. Care and feeding of topic models: problems, diagnostics, and improvements. Handbook of Mixed Membership Models and Their Applications3–34 Boca Raton, FL: CRC Press [Google Scholar]
Boydstun A, Butters R, Card D, Gross J, Resnik P, Smith N. 2016. Under what conditions does media framing influence public opinion on immigration? Presented at Annu. Meet. Midwest Polit. Sci. Assoc., Chicago, IL, Apr 7–9
Cardie C, Wilkerson J. 2008. Text annotation for political science research. J. Inf. Technol. Polit. 5:11–6 [Google Scholar]
Carneiro HA, Mylonakis E. 2009. Google trends: a web-based tool for real-time surveillance of disease outbreaks. Clin. Infect. Dis. 49:101557–64 [Google Scholar]
Casas A, Davesa F, Congosto M. 2016. The media coverage of a connective action: the interaction between the 15-M Movement and the mass media. Rev. Espan. Investig. Sociol. 155:73–96 [Google Scholar]
Ceron A, Curini L, Iacus SM, Porro G. 2014. Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens' political preferences with an application to Italy and France. New Media Soc. 16:2340–58 [Google Scholar]
Chang J, Boyd-Graber J, Wang C, Gerrish S, Blei DM. 2009. Reading tea leaves: how humans interpret topic models. Advances in Neural Information Processing Systems Y Bengio, D Schuurmans, J Lafferty, CKI Williams, A Culotta 288–96 Cambridge, MA: MIT Press [Google Scholar]
Chuang J, Roberts M, Stewart B, Weiss R, Tingley D. et al. 2015. TopicCheck: interactive alignment for assessing topic model stability. Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL175–84 Denver, CO: Assoc. Comput. Linguist.
Chuang J, Wilkerson JD, Weiss R, Tingley D, Stewart BM. et al. 2014. Computer-assisted content analysis: topic models for exploring multiple subjective interpretations. Presented at Advances in Neural Information Processing Systems Workshop on Human-Propelled Machine Learning, Montreal, Dec. 8–13
Collingwood L, Wilkerson J. 2011. Tradeoffs in accuracy and efficiency in supervised learning methods. J. Inf. Technol. Polit. 4:1–28 [Google Scholar]
Denny MJ, O'Connor B, Wallach H. 2015. A little bit of NLP goes a long way: finding meaning in legislative texts with phrase extraction Presented at Annu. Meet. Midwest Polit. Sci. Assoc., 73rd, Apr. 16–19
Denny MJ, Spirling A. 2017. Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it. Unpublished manuscript, Dep. Polit. Sci, Stanford Univ and Inst. Quant. Soc. Sci., Harvard Univ. https://ssrn.com/abstract=2849145
Diermeier D, Yu B, Kaufmann S, Godbout JE. 2012. Language and ideology in Congress. Br. J. Polit. Sci. 42:131–55 [Google Scholar]
Domingos P. 2015. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World New York: Basic Books
Drutman L, Hopkins DJ. 2013. The inside view: using the Enron email archive to understand corporate political attention. Legis. Stud. Q. 38:15–30 [Google Scholar]
Eggers A, Spirling A. 2017. The shadow cabinet in Westminster systems: modeling opposition agenda setting in the House of Commons. 1832–1915 Br. J. Polit. Sci. In press
Farrell J. 2016. Corporate funding and ideological polarization about climate change. PNAS 113:192–97 [Google Scholar]
Gerner DJ, Schrodt PA, Francisco RA, Weddle JL. 2014. Machine coding of event data using regional and international sources. Int. Stud. Q. 38:191 [Google Scholar]
Grimmer J. 2013. Appropriators not position takers: the distorting effects of electoral incentives on congressional representation. Am. J. Polit. Sci. 57:3624–42 [Google Scholar]
Grimmer J, King G. 2011. General purpose computer-assisted clustering and conceptualization. PNAS 108:72643–50 [Google Scholar]
Grimmer J, King G, Superti C. 2016. The unreliability of measures of intercoder reliability, and what to do about it. Unpublished manuscript, Dep. Polit. Sci., Stanford Univ. http://web.stanford.edu/∼jgrimmer/Handbib.pdf
Grimmer J, Stewart BM. 2013. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21:3267–97 [Google Scholar]
Hertel-Fernandez A, Kashin K. 2015. Capturing business power across the states with text reuse Presented at Annu. Meet. Midwest Polit. Sci. Assoc., Chicago, IL, Apr. 16–19
Hopkins DJ, King G. 2010. A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54:1229–47 [Google Scholar]
Huang A. 2008. Similarity measures for text document clustering. Proc. Sixth New Zealand Computer Science Research Student Conference49–56 Christchurch, New Zealand: NZCSRSC [Google Scholar]
Jansa J, Hansen E, Gray V. 2015. Copy and paste lawmaking: the diffusion of policy language across American state legislatures Work. Pap., Dep. Polit. Sci., Univ. North Carolina, Chapel Hill
Jockers ML. 2014. Text Analysis with R for Students of Literature New York: Springer
King G, Pan J, Roberts ME. 2013. How censorship in China allows government criticism but silences collective expression. Am. Polit. Sci. Rev. 107:2326–43 [Google Scholar]
Kluver H. 2009. Measuring interest group influence using quantitative text analysis. Eur. Union Polit. 10:4535–49 [Google Scholar]
Kohavi R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. Int. Joint Conf. Artificial Intelligence1137–43 San Francisco: Morgan Kaufmann [Google Scholar]
Lauderdale BE, Clark TS. 2014. Scaling politically meaningful dimensions using texts and votes. Am. J. Polit. Sci. 58:3754–71 [Google Scholar]
Lauderdale BE, Herzog A. 2016. Measuring political positions from legislative speech. Polit. Anal. 26:374–94 [Google Scholar]
Laver M, Benoit K, Garry J. 2003. Extracting policy positions from political texts using words as data. Am. Polit. Sci. Rev. 2:311–31 [Google Scholar]
Leetaru K, Schrodt P. 2013. GDELT: global data on events, location, and tone, 1979–2012 Presented at International Studies Association Annu. Conv., San Francisco, CA, Apr.
Leskovec J, Backstrom L, Kleinberg J. 2009. Memetracking and the dynamics of the news cycle Presented at ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), Paris, June
Lowe W. 2008. Understanding Wordscores. Polit. Anal. 164356–71
Mayhew DR. 1974. Congress: The Electoral Connection New Haven, CT: Yale Univ. Press
Monroe BL, Schrodt PA. 2008. Introduction to the special issue: the statistical analysis of political text. Polit. Anal. 16:4351–55 [Google Scholar]
Munzert S, Rubba C, Meissner P, Nyhuis D. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining Hoboken, NJ/Chichester, UK: Wiley & Sons
Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:3443–53 [Google Scholar]
Petrocik JR. 1996. Issue ownership in presidential elections, with a 1980 case study. Am. J. Polit. Sci. 40:3825–50 [Google Scholar]
Quinn KM, Monroe BL, Colaresi M, Crespin MH, Radev DR. 2010. How to analyze political attention with minimal assumptions and costs. Am. J. Polit. Sci. 54:1209–28 [Google Scholar]
Roberts ME, Stewart BM, Tingley D, Lucas C, Leder-Luis J. et al. 2014. Structural topic models for open-ended survey responses: structural topic models for survey responses. Am. J. Polit. Sci. 58:41064–82 [Google Scholar]
Roberts M, Stewart B, Tingley D. 2016. Navigating the local modes of big data: the case of topic models. Computational Social Sciences RM Alvarez 51–97 New York: Cambridge Univ. Press [Google Scholar]
Saldana J. 2009. The Coding Manual for Qualitative Researchers Los Angeles: Sage
Schmidt BM. 2012. Words alone: dismantling topic models in the humanities. J. Digit. Humanit. (2)1. http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/
Schmidt B. 2015. Is it fair to rate professors online?. New York Times, Dec. 16, Sec. Room for Debate [Google Scholar]
Schneider J. 2015. One-minute speeches: current house practices Congr. Res. Serv. Rep. 7-5700, 1-7
Schrodt PA, Gerner DJ. 1994. Validity assessment of a machine-coded event data set for the Middle East, 1982–92. Am. J. Polit. Sci. 38:3825 [Google Scholar]
Slapin JB, Proksch S-O. 2008. A scaling model for estimating time-series party positions from texts. Am. J. Polit. Sci. 52:3705–22 [Google Scholar]
Smith DA, Cordell R, Dillon EM. 2013. Infectious texts: modeling text reuse in nineteenth-century newspapers. Proc. IEEE Int. Conf. Big Data86–94 Santa Clara, CA: Inst. Electrical and Electronics Engineers [Google Scholar]
Smith TF, Waterman MS. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:1195–97 [Google Scholar]
Van Atteveldt W, Shenhav SR, Fogel-Dror Y. 2017. Clause analysis: using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008–2009 Gaza War. Polit. Anal. In press
Wallach H, Dicker L, Jensen S. 2010. An alternative prior for nonparametric Bayesian clustering. Proc. Thirteenth International Conference on Artificial Intelligence and Statistics, May 13–15, 2010, Chia Laguna Resort, Sardinia, Italy YW Teh, M Titterington 9892–99 http://www.jmlr.org/proceedings/papers/v9/
Ward M, Beger A, Josh C, Dickenson M, Dorff C, Radford B. 2013. Comparing GDELT and ICEWS event data. Analysis 21:267–97 [Google Scholar]
Ward M, Stovel K, Sacks A. 2011. Network analysis and political science. Annu. Rev. Polit. Sci. 14:245–64 [Google Scholar]
Wilkerson J, Smith D, Stramp N. 2015. Tracing the flow of policy ideas in legislatures: a text reuse approach. Am. J. Polit. Sci. 59:4943–56 [Google Scholar]
Workman S. 2015. The Dynamics of Bureaucracy in the US Government: How Congress and Federal Agencies Process Information and Solve Problems Cambridge, UK: Cambridge Univ. Press