1932

Abstract

The current data revolution is changing the conduct of social science research as increasing amounts of digital and administrative data become accessible for use. This new data landscape has created significant tension around data privacy and confidentiality. The risk–utility theory and models underpinning statistical disclosure limitation may be too restrictive for providing data confidentially owing to the growing volumes and varieties of data and the evolving privacy policies. Science and society need to move to a trust-based approach from which both researchers and participants benefit. This review discusses the explosive growth of the new data sources and the parallel evolution of privacy policy and governance, with a focus on access to data for research. We provide a history of privacy policy, statistical disclosure limitation research, and record linkage in the context of this brave new world of data.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033453
2016-06-01
2024-05-26
Loading full text...

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033453.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033453&mimeType=html&fmt=ahah

Literature Cited

  1. Abowd JM, Lane JI. 2003. Synthetic data and confidentiality protection Tech. Pap. TP-2003-10, Cent. Econ. Stud., US Census Bur., Washington, DC
  2. Abowd JM, Lane JI. 2004. New approaches to confidentiality protection: synthetic data, remote access and research data centers. Privacy in Statistical Databases J Domingo-Ferrer, V Torra 282–89 Lect. Notes Comput. Sci. Ser. 3050 Berlin: Springer [Google Scholar]
  3. Abowd JM, Nissim K, Skinner CJ. 2009. First issue editorial. J. Priv. Confid. 1:11 [Google Scholar]
  4. Abowd JM, Schneider MJ. 2011. An application of differentially private linear mixed modeling. IEEE 11th Int. Conf. Data Min. Workshops M Spiliopoulou, H Wang, D Cook, J Pei, W Wang et al ., pp. 614–19 New York: IEEE [Google Scholar]
  5. Abowd JM, Stephens BE, Vilhuber L, Andersson F, McKinney KL. et al. 2009. The LEHD infrastructure files and the creation of the quarterly workforce indicators. Producer Dynamics: New Evidence from Micro Data T Dunne, JB Jensen, MJ Roberts 149–230 Chicago: Univ. Chicago Press [Google Scholar]
  6. Acquisti A, Brandimarte L, Loewenstein G. 2015. Privacy and human behavior in the age of information. Science 347:6221509–14 [Google Scholar]
  7. Agafiţei M, Gras F, Kloek W, Reis F. 2015. Measuring output quality for multisource statistics in official statistics: some directions. Stat. J. IAOS 31:2203–11 [Google Scholar]
  8. Bachteler T, Schnell R, Reiher J. 2010. An empirical comparison of approaches to approximate string matching in private record linkage. Proc. Stat. Can. Symp. 2010: Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Ottawa, Can ., pp. 290–95 Ottawa, Can: Stat. Can. [Google Scholar]
  9. Baxter R, Christen P, Churches T. 2003. A comparison of fast blocking methods for record linkage. Proc. 2003 ACM SIGKDD Workshop Data Clean., Rec. Link. Object Consol.25–27 New York: ACM [Google Scholar]
  10. Berinsky AJ, Huber GA, Lenz GS. 2012. Evaluating online labor markets for experimental research: Amazon.com's Mechanical Turk. Pol. Anal. 20:3351–68 [Google Scholar]
  11. Bohannon J. 2011. Social science for pennies. Science 334:6054307 [Google Scholar]
  12. Bohannon J. 2015. Privacy. Credit card study blows holes in anonymity. Science 347:6221468 [Google Scholar]
  13. Braaksma B, Zeelenberg K. 2015. “Re-make/re-model”: Should big data change the modelling paradigm in official statistics?. Stat. J. IAOS 31:193–202 [Google Scholar]
  14. Brennan N, Conway PH, Tavenner M. 2014. The Medicare physician-data release-context and rationale. N. Engl. J. Med. 371:299–101 [Google Scholar]
  15. Burwell S. 2014. Guidance for providing and using administrative data for statistical purposes. Memo. OMB M-14-06, Off. Manag. Budg., Washington, DC
  16. Campbell S, Shipp S, Mulcahy T, Allen T. 2009. Informing public policy on science and innovation: the Advanced Technology Program's experience. J. Technol. Transf. 34:3304–19 [Google Scholar]
  17. CDC (Cent. Dis. Control Prev.) 2015a. How NCHS protects your privacy Cent. Dis. Control Prev., Atlanta. http://www.cdc.gov/nchs/about/policy/confidentiality.htm
  18. CDC (Cent. Dis. Control Prev.) 2015b. NCHS Research Data Center (RDC) Cent. Dis. Control Prev., Atlanta. http://www.cdc.gov/rdc/index.htm
  19. Christen P. 2006. A comparison of personal name matching: techniques and practical issues. Sixth IEEE Int. Conf. Data Min. Workshops Hong Kong 290–94 New York: IEEE [Google Scholar]
  20. Christen P. 2012. The Data Matching Process Berlin: Springer
  21. Christen P, Goiser K. 2007. Quality and complexity measures for data linkage and deduplication. Quality Measures in Data Mining F Guillet, HJ Hamilton 127–51 Berlin: Springer [Google Scholar]
  22. Churches T, Christen P. 2004. Some methods for blindfolded record linkage. BMC Med. Inform. Decis. Mak. 4:19 [Google Scholar]
  23. Cox LH. 1982. Solving confidentiality protection problems in tabulations using network optimization: a network model for cell suppression in the US economic censuses. Proc. Int. Semin. Stat. Confid. Dublin 229–45 The Hague, Neth: Int. Stat. Inst. [Google Scholar]
  24. Dalenius T, Reiss SP. 1982. Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6:173–85 [Google Scholar]
  25. de Montjoye Y-A, Radaelli L, Singh VK, Pentland AS. 2014. Unique in the shopping mall: on the re-identifiability of credit card metadata. Science 347:6221536–39 [Google Scholar]
  26. Denning DE. 1982. Cryptography and Data Security Reading, MA: Addison-Wesley
  27. Duncan G. 2007. Privacy by design. Science 317:58421178–79 [Google Scholar]
  28. Duncan G, Fienberg S. 1997. Obtaining information while preserving privacy: a Markov perturbation method for tabular data. Jt. Stat. Meet. Proc.351–62 Alexandria, VA: Am. Stat. Assoc. [Google Scholar]
  29. Duncan G, Keller-McNulty S, Stokes SL. 2001. Disclosure risk versus data utility: the RU confidentiality map Tech. Rep. 121, Natl. Inst. Stat. Sci., Research Triangle Park, NC
  30. Duncan G, Keller-McNulty S, Stokes S. 2004. Database security and confidentiality: examining disclosure risk versus data utility through the RU confidentiality map Tech. Rep. 142, Natl. Inst. Stat. Sci., Research Triangle Park, NC
  31. Duncan G, Lambert D. 1986. Disclosure-limited data dissemination. J. Am. Stat. Assoc. 81:39310–18 [Google Scholar]
  32. Duncan G, Lambert D. 1989. The risk of disclosure for microdata. J. Bus. Econ. Stat. 7:2207–17 [Google Scholar]
  33. Dwork C, Roth A. 2013. The algorithmic foundations of differential privacy. Theor. Comput. Sci. 9:3–4211–407 [Google Scholar]
  34. Dwork C, Smith A. 2010. Differential privacy for statistics: what we know and what we want to learn. J. Priv. Confid. 1:2135–54 [Google Scholar]
  35. EOP (Exec. Off. Pres.) 2014. Big data: seizing opportunities, preserving values Rep., Exec. Off. Pres., Washington, DC. https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf
  36. Erlich Y, Williams JB, Glazer D, Yocum K, Farahany N. et al. 2014. Redefining genomic privacy: trust and empowerment. PLOS Biol. 12:11e1001983 [Google Scholar]
  37. FCC (Fed. Commun. Comm.) 2011. Annual report and analysis of competitive market conditions with respect to mobile wireless, including commercial mobile services Rep., Fed. Commun. Comm., Washington, DC. https://www.fcc.gov/document/15th-mobile-wireless-competition-report
  38. Fellegi IP, Sunter AB. 1969. A theory for record linkage. J. Am. Stat. Assoc. 64:3281183–210 [Google Scholar]
  39. Fienberg SE, Jin J. 2012. Privacy-preserving data sharing in high dimensional regression and classification settings. J. Priv. Confid. 4:1221–43 [Google Scholar]
  40. Gilbert N. 2007. Dilemmas of privacy and surveillance: challenges of technological change. Crim. Justice Matters 68:141–42 [Google Scholar]
  41. Gomatam S, Karr A, Reiter J, Sanil A. 2005. Data dissemination and disclosure limitation in a world without microdata: a risk–utility framework for remote access analysis servers. Stat. Sci. 20:2163–77 [Google Scholar]
  42. Goroff DL. 2015. Balancing privacy versus accuracy in research protocols. Science 347:6221479–80 [Google Scholar]
  43. Hall R, Fienberg SE. 2010. Privacy-preserving record linkage. Privacy in Statistical Databases J Domingo-Ferrer, E Magkos 269–83 Lect. Notes Comput. Sci. Ser. 6344 Berlin: Springer [Google Scholar]
  44. Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H. 2014. Evaluating bias due to data linkage error in electronic healthcare records. BMC Med. Res. Methodol. 14:36 [Google Scholar]
  45. Hernández MA, Stolfo SJ. 1998. Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2:19–37 [Google Scholar]
  46. Herzog TN, Scheuren FJ, Winkler WE. 2007. Data Quality and Record Linkage Techniques New York: Springer
  47. Hjaltason GR, Samet H. 2003. Properties of embedding methods for similarity searching in metric spaces. IEEE Trans. Pattern Anal. Mach. Intell. 25:5530–49 [Google Scholar]
  48. IOM (Inst. of Med.) 2009. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research Washington, DC: Natl. Acad. Press
  49. Kagal L, Abelson H. 2010. Access control is an inadequate framework for privacy protection. W3C Workshop Priv. Adv. Web APIs London 1–6 http://www.w3.org/2010/api-privacy-ws/papers/privacy-ws-23.pdf
  50. Kalapesi C. 2013. Unlocking the value of personal data: from collection to usage. Tech. Rep., World Econ. Forum, Geneva. www3.weforum.org/…/WEF_IT_UnlockingValuePersonalData_CollectionUsage_Report_2013.pdf
  51. Karr A, Reiter JP. 2014. Analytical frameworks for data release: a statistical view. Privacy, Big Data and the Public Good J. Lane, V Stodden, S Bendor, H Nissenbaum 276–95 New York: Cambridge Univ. Press [Google Scholar]
  52. Keller S, Koonin S, Shipp S. 2012. Big data and city living—What can it do for us?. Significance 9:44–7 [Google Scholar]
  53. Keller S, Shipp S. 2016. Building resilient cities: harnessing the power of urban analytics. The Resilience Challenge: Looking at Resilience through Multiple Lenses Springfield, IL: Thomas. In press [Google Scholar]
  54. Keller-McNulty S, Nakhleh C, Singpurwalla N. 2005. A paradigm for masking (camouflaging) information. Int. Stat. Rev. 73:3331–49 [Google Scholar]
  55. Keller-McNulty S, Unger E. 1993. Database systems: inferential security. J. Off. Stat. 9:475–99 [Google Scholar]
  56. Keller-McNulty S, Unger E. 1998. A database system prototype for remote access to information based on confidential data. J. Off. Stat. 14:347–60 [Google Scholar]
  57. Kim G, Chambers R. 2012. Regression analysis under probabilistic multi-linkage. Stat. Neerl. 66:164–79 [Google Scholar]
  58. Kum H-C, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. 2014. Privacy preserving interactive record linkage (PPIRL). J. Am. Med. Inform. Assoc. 21:2212–20 [Google Scholar]
  59. Landau S. 2015. Control use of data to protect privacy. Science 347:6221504–6 [Google Scholar]
  60. Lane J, Shipp S. 2008. Using a remote access data enclave for data dissemination. Int. J. Digit. Curation 2:1128–34 [Google Scholar]
  61. Lazer DM, Kennedy R, King G, Vespignani A. 2014. The parable of Google Flu: traps in big data analysis. Science 14:1203–5 [Google Scholar]
  62. Manyika J, Chui M, Brown B, Bughin J, Dobbs R. et al. 2011. Big data: the next frontier for innovation, competition, and productivity Tech. Rep., McKenzie Glob. Inst., San Francisco. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
  63. Mason W, Suri S. 2012. Conducting behavioral research on Amazon's Mechanical Turk. Behav. Res. Methods 44:11–23 [Google Scholar]
  64. Mundie C. 2014. Privacy pragmatism. Foreign Aff. 93:27–8 [Google Scholar]
  65. Nissenbaum H. 2004. Privacy in context: technology, policy, and the integrity of social life. Wash. Law Rev. 79:1119–57 [Google Scholar]
  66. NRC (Natl. Res. Counc.) 1993. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Washington, DC: Natl. Acad. Press
  67. NRC (Natl. Res. Counc.) 1999. Record Linkage Techniques—1997: Proc. Int. Workshop Expo Arlington, VA. Washington, DC: Natl. Acad. Press
  68. NRC (Natl. Res. Counc.) 2007a. Engaging Privacy and Information Technology in a Digital Age Washington, DC: Natl. Acad. Press
  69. NRC (Natl. Res. Counc.) 2007b. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data Washington, DC: Natl. Acad. Press
  70. NRC (Natl. Res. Counc.) 2013. Frontiers in Massive Data Analysis Washington, DC: Natl. Acad. Press
  71. OECD (Organ. Econ. Co-op. Dev.) 1980. Guidelines on the Protection of Privacy and Transborder Flows of Personal Data Paris: Organ. Econ. Co-op. Dev.
  72. OECD (Organ. Econ. Co-op. Dev.) 2013. The OECD Privacy Framework Paris: Organ. Econ. Co-op. Dev.
  73. OMB (Off. Manag. Budg.) 2007. 72 FR 33361Implementation guidance for Title V of the E-Government Act, Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Fed. Regist. 72, no. 115 (June 15, 2007) 33361–77 Washington, DC: US Gov. Publ. Off https://www.gpo.gov/fdsys/pkg/FR-2007-06-15/pdf/E7-11542.pdf [Google Scholar]
  74. Pang C, Hansen D. 2006. Improved record linkage for encrypted identifying data. Proc. 14th Annu. Health Inform. Conf. Sydney 164–68 Brunswick East, Aust: Health Inform. Soc. Aust. [Google Scholar]
  75. PCAST (Pres. Counc. Advis. Sci. Technol.) 2014. Big data and privacy: a technology perspective Rep. Exec. Off. Pres. Pres. Counc. Advis. Sci. Technol., Washington, DC. https://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf
  76. Raghunathan TE, Reiter JP, Rubin DB. 2003. Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19:11–16 [Google Scholar]
  77. Reiter JP. 2005a. Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100:4721103–12 [Google Scholar]
  78. Reiter JP. 2005b. Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. A 168:1185–205 [Google Scholar]
  79. Reiter JP. 2009. Multiple imputation for disclosure limitation: future research challenges. J. Priv. Confid. 1:2223–33 [Google Scholar]
  80. Reiter JP, Wang Q, Zhang B. 2014. Bayesian estimation of disclosure risks for multiply imputed, synthetic data. J. Priv. Confid. 6:117–33 [Google Scholar]
  81. Rivest RL. 1998. Chaffing and winnowing: confidentiality without encryption. CryptoBytes (RSA Lab.) 4:112–17 [Google Scholar]
  82. Rubin DB. 1987. Multiple Imputation for Nonresponse in Surveys New York: Wiley
  83. Rubin DB. 1996. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91:434473–89 [Google Scholar]
  84. Sadinle M, Fienberg SE. 2013. A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108:502385–97 [Google Scholar]
  85. Sadinle M. 2014. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8:42404–34 [Google Scholar]
  86. Scannapieco M, Figotin I, Bertino E, Elmagarmid AK. 2007. Privacy preserving schema and data matching. Proc. ACM SIGMOD Int. Conf. Manag. Data Beijing 653–64 New York: ACM [Google Scholar]
  87. Schneider MJ, Abowd JM. 2015. A new method for protecting interrelated time series with Bayesian prior distributions and synthetic data. J. R. Stat. Soc. A 178:4963–75 [Google Scholar]
  88. Schneider MJ, Massa T, Vivari B. 2012. The earning power of recent graduates from Virginia's colleges and universities: How are graduates from different degree programs doing in the labor market? Rep., Econ. Success Metr. Proj., Am. Inst. Res., Washington, DC. http://www.air.org/sites/default/files/downloads/report/Virginia_EMS_Report1_0.pdf
  89. Schnell R, Bachteler T, Reiher J. 2009. Privacy-preserving record linkage using Bloom filters. BMC Med. Inform. Decis. Mak. 9:141 [Google Scholar]
  90. Schouten B, Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Stat. Comput. 13:4381–89 [Google Scholar]
  91. Schroeder AD. 2012. Pad and chaff: secure approximate string matching in private record linkage. Proc. 14th Int. Conf. Inform. Integr. Web-Based Appl. Serv. Bali, Indones ., pp.121–25 New York: ACM [Google Scholar]
  92. Schwab K, Marcus A, Oyola J, Hoffman W, Luzi M. 2011. Personal data: the emergence of a new asset class Tech. Rep., World Econ. Forum, Geneva. http://www3.weforum.org/docs/WEF_ITTC_PersonalDataNewAsset_Report_2011.pdf
  93. Skinner CJ. 2008. Assessing disclosure risk for record linkage. Privacy in Statistical Databases J Domingo-Ferrer, Y Saygin , pp. 166–76 Lect. Notes Comput. Sci. Ser. 5262 Berlin: Springer [Google Scholar]
  94. Spears JV, Bradburn I, Schroeder AD, Tester D, Forry N. 2012. New data on child care subsidy programs. Policy Pract. 2012:Aug.18–21 [Google Scholar]
  95. Steorts RC, Hall R, Fienberg SE. 2014. SMERED: a Bayesian approach to graphical record linkage and de-duplication. arXiv:1403.0211 [stat.CO]
  96. Sweeney L. 2002. k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10:5557–70 [Google Scholar]
  97. UN Gen. Assem. Resolut. 217 A (III) 1948. Universal Declaration of Human Rights Dec. 10. UN Doc. A/810. United Nations, New York. http://www.un.org/en/documents/udhr/index.shtml
  98. US Census Bur 2013. Center for Economic Studies and Research Data Centers research report: 2012 Rep. Res. Methodol. Dir., US Census Bur., Washington, DC
  99. US Census Bur 2015a. Data protection and privacy. Data protection. US Census Bur., Washington, DC. https://www.census.gov/about/policies/privacy/data_protection.html [Google Scholar]
  100. US Census Bur 2015b. Data protection and privacy. Title 13—Protection of confidential information. US Census Bur., Washington, DC. http://www.census.gov/about/policies/privacy/data_protection/title_13_-_protection_of_confidential_information.html
  101. US Census Bur 2015c. Privacy and confidentiality. Title 26, US Code. US Census Bur., Washington, DC. https://www.census.gov/history/www/reference/privacy_confidentiality/title_26_us_code_1.html
  102. US Census Bur 2015d. History of public use microdata areas (PUMAs): 1960–2000 U.S. Census Bur., Washington, DC
  103. US DOE (US Dep. Educ.) 2015. Confidentiality laws Stat. Stand. Program, Natl. Cent. Edu. Stat., US Dep. Educ., Washington, DC. http://nces.ed.gov/statprog/conflaws.asp
  104. US DOJ (US Dep. Justice) 2015. What is FOIA? US Dep. Justice, Washington, DC. http://www.foia.gov/index.html [Google Scholar]
  105. VASEM (Virg. Summit Sci. Eng. Med.) 2014. Meeting on big data: report of December 4–5, 2014 Rep., Virg. Summit Sci. Eng. Med., Washington, DC. http://seas.virginia.edu/admin/vasem/news/pdfs/vasem_big_data_2014.pdf
  106. Vatsalan D, Christen P, O'Keefe CM, Verykios VS. 2014. An evaluation framework for privacy-preserving record linkage. J. Priv. Confid. 6:135–75 [Google Scholar]
  107. Vatsalan D, Christen P, Verykios VS. 2013. A taxonomy of privacy-preserving record linkage techniques. Inform. Syst. 38:6946–69 [Google Scholar]
  108. Verykios VS, Karakasidis A, Mitrogiannis VK. 2009. Privacy preserving record linkage approaches. Int. J. Data Min. Model. Manag. 1:2206–21 [Google Scholar]
  109. Wallman KK, Harris-Kojetin BA. 2004. Implementing the Confidential Information Protection and Statistical Efficiency Act of 2002. Chance 17:321–25 [Google Scholar]
  110. Warren S, Brandeis L. 1890. The right to privacy. Harvard Law Rev. 4:193–220 [Google Scholar]
  111. Winkler WE. 2006. Overview of record linkage and current research directions Res. Rep., Stat. 2006-2. Stat. Res. Div., US Census Bur., Washington, DC
/content/journals/10.1146/annurev-statistics-041715-033453
Loading
/content/journals/10.1146/annurev-statistics-041715-033453
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error