1932

Abstract

Data sharing and access are venerable problems embedded in a rapidly changing milieu. Pressure points include the increasingly data-driven nature of science, the volume, complexity, and distributed nature of data, new concerns regarding privacy and confidentiality, and rising attention to reproducibility of research. In the context of research data, this review surveys extant technologies, articulates a number of identified and emerging issues, and outlines one path for the future. Recognizing that data availability is a public good, research data archives can provide economic and scientific value to both data generators and data consumers in a way that engenders trust. The overall framework is statistical—the use of data for inference.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033438
2016-06-01
2024-04-18
Loading full text...

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033438.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033438&mimeType=html&fmt=ahah

Literature Cited

  1. Abowd JM. 2013. Presentation: revisiting the economics of privacy: population statistics and privacy as public good http://www.digitalcommons.ilr.cornell.edu/cgi/viewcontent.cgi?article=1009&context=ldi
  2. Benaloh J. 1987. Secret sharing homomorphisms: keeping shares of a secret secret. CRYPTO86 AM Odlyzko 251–60 Lect. Notes Comput. Sci. Ser. 263 New York: Springer-Verlag [Google Scholar]
  3. Brandeis LD, Warren SD. 1890. The right to privacy. Harvard Law Rev. 4:5193 [Google Scholar]
  4. Christen P. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection New York: Springer-Verlag
  5. Clifton C, Doan A, Elmagarmid A, Kantarcioglu M, Schadow G. et al. 2004. Privacy-preserving data integration and sharing Presented at 9th ACM SIGMOD Workshop Res. Issues Data Min. Knowl. Discov., Paris
  6. Cox LH, Karr AF, Kinney SK. 2011. Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act (with discussion). Int. Stat. Rev. 79:2160–99 [Google Scholar]
  7. Dobra A, Fienberg SE, Karr AF, Sanil AP. 2002. Software systems for tabular data releases. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10:5529–44 [Google Scholar]
  8. Dobra A, Karr AF, Sanil AP. 2003. Preserving confidentiality of high-dimensional tabulated data: statistical and computational issues. Stat. Comput. 13:363–70 [Google Scholar]
  9. Doyle P, Lane JI, Theeuwes JJM, Zayatz LV. 2001. Confidentiality, Disclosure and Data Access: Theory and Practical Application for Statistical Agencies Amsterdam: Elsevier
  10. Du W, Han Y, Chen S. 2004. Privacy-preserving multivariate statistical analysis: linear regression and classification. Proc. 4th SIAM Int. Conf. Data Min.222–33 Philadelphia: Soc. Ind. Math. [Google Scholar]
  11. Duncan GT, Keller-McNulty SA, Stokes SL. 2004. Database security and confidentiality: examining disclosure risk versus data utility through the R-U confidentiality map Tech. Rep. 142, Natl. Inst. Stat. Sci., Research Triangle Park, N.C. http://www.niss.org/sites/default/files/pdfs/technicalreports/tr142.pdf
  12. Dwork C. 2006. Differential privacy. Automata, Languages and Programming M Bugliesi, B Preneel, V Sassone, I Wegener 1–12 Lect. Notes Comput. Sci. Ser. 4052 New York: Springer-Verlag [Google Scholar]
  13. Dwork C. 2008. Differential privacy: a survey of results. Theory and Applications of Models of Computation M Agrawal, D Du, Z Duan, A Li 1–19 Lect. Notes Comput. Sci. Ser. 4978 New York: Springer-Verlag [Google Scholar]
  14. Dwork C. 2014. Differential privacy: a cryptographic approach to private data analysis. Big Data, Privacy, and the Public Good: Frameworks for Engagement J Lane, V Stodden, S Bender, H Nissenbaum 296–322 Cambridge, UK: Cambridge Univ. Press [Google Scholar]
  15. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. 2015. The reusable holdout: preserving validity in adaptive data analysis. Science 349:636–38 [Google Scholar]
  16. Fienberg SE. 2010. The relevance or irrelevance of weights in statistical disclosure limitation. J. Priv. Confid. 1:2183–95 [Google Scholar]
  17. Fienberg SE, Karr AF, Nardi Y, Slavkovic A. 2007. Secure logistic regression with distributed databases Presented at 56th Session Int. Stat. Inst., Lisbon
  18. Ghosh J, Reiter JP, Karr AF. 2007. Secure computation with horizontally partitioned data using adaptive regression splines. Comput. Stat. Data Anal. 51:5813–20 [Google Scholar]
  19. Goldwasser S. 1997. Multi-party computations: past and present. Proc. 16th Annu. ACM Symp. Princ. Distrib. Comput. Santa Barbara, CA 1–6 New York: ACM [Google Scholar]
  20. Gomatam S, Karr AF, Reiter JP, Sanil AP. 2005a. Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access analysis servers. Stat. Sci. 20:2163–77 [Google Scholar]
  21. Gomatam S, Karr AF, Sanil AP. 2005b. Data swapping as a decision problem. J. Off. Stat. 21:4635–56 [Google Scholar]
  22. Gravelle H, Rees R. 2004. Microeconomics New York: Prentice Hall
  23. Hall R, Fienberg SE. 2013. Privacy-preserving record linkage. Work. Pap., Dep. Stat., Carnegie Mellon Univ. http://www.cs.cmu.edu/∼rjhall/linkage_survey_final.pdf
  24. Holan SH, Toth D, Ferreira MAR, Karr AF. 2010. Bayesian multiscale multiple imputation with implications to data confidentiality. J. Am. Stat. Assoc. 105:564–77 [Google Scholar]
  25. Horvitz DG, Thompson DJ. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85 [Google Scholar]
  26. Ioannidis JPA. 2005. Why most published research findings are false. PLOS Med. 2:8e124 [Google Scholar]
  27. Karr AF. 2009. The role of transparency in statistical disclosure limitation. Proc. Jt. UNECE/Eurostat Work Session Stat. Data Confid. Bilbao, Spain 2–4 Geneva, Switz: U.N. Econ. Comm. Eur http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.41.e.pdf [Google Scholar]
  28. Karr AF. 2010. Secure statistical analysis of distributed databases, emphasizing what we don't know. J. Priv. Confid. 1:2197–211 [Google Scholar]
  29. Karr AF. 2014. Why data availability is such a hard problem. Stat. J. IAOS 30:101–7 [Google Scholar]
  30. Karr AF, Feng J, Lin X, Sanil AP, Young SS, Reiter JP. 2005. Secure analysis of distributed chemical databases without data integration. J. Comput. Aided Mol. Des. 19:9–10739–47 [Google Scholar]
  31. Karr AF, Fulp WJ, Lin X, Reiter JP, Vera F, Young SS. 2007. Secure, privacy-preserving analysis of distributed databases. Technometrics 49:3335–45 [Google Scholar]
  32. Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. 2006. A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60:224–32 [Google Scholar]
  33. Karr AF, Lee J, Sanil AP, Hernandez J, Karimi S, Litwin K. 2002. Web-based systems that disseminate information from data but preserve confidentiality. Advances in Digital Government: Technology, Human Factors, and Policy E Elmagarmid, WM McIver 181–96 Boston: Kluwer [Google Scholar]
  34. Karr AF, Lin X. 2010. Privacy-preserving maximum likelihood estimation for distributed data. J. Priv. Confid. 1:2213–22 [Google Scholar]
  35. Karr AF, Reiter JP. 2014. Using statistics to protect privacy. Big Data, Privacy, and the Public Good: Frameworks for Engagement J Lane, V Stodden, S Bender, H Nissenbaum 276–95 Cambridge, UK: Cambridge Univ. Press [Google Scholar]
  36. Kim HJ, Karr AF, Reiter JP. 2015. Statistical disclosure limitation in the presence of edit rules. J. Off. Stat. 21:121–38 [Google Scholar]
  37. Kinney SK, Gonzalez JF Jr, Karr AF. 2009. Data confidentiality—the next five years: summary and guide to papers. J. Priv. Confid. 1:2125–34 [Google Scholar]
  38. Lindell Y, Pinkas B. 2000. Privacy preserving data mining. Adv. Cryptol.—Crypto200020–24 Lect. Notes Comput. Sci. Ser. 1880 New York: Springer-Verlag [Google Scholar]
  39. Moore A, Lee M. 1998. Cached sufficient statistics for efficient machine learning with large datasets. J. Artif. Intell. Res. 8:67–91 [Google Scholar]
  40. NRC (Natl. Res. Counc.) 2014. Proposed Revisions to the Common Rule for the Protection of Human Subjects in the Behavioral and Social Sciences Wash., DC: Nat. Acad. Press
  41. NSF (Natl. Sci. Found.) 2015. 45 CFR part 690: Federal policy for the protection of human subjects, subpart A: the common rule for the protection of human subjects https://www.nsf.gov/bfa/dias/policy/docs/45cfr690.pdf [Google Scholar]
  42. Oganian A, Karr AF. 2006. Combinations of SDC methods for microdata protection. Privacy in Statistical Databases J Domingo-Ferrer, L Franconi 102–13 Lect. Notes Comput. Sci. Ser. 4302 New York: Springer-Verlag [Google Scholar]
  43. Oganian A, Reiter JP, Karr AF. 2009. Verification servers: enabling analysts to assess the quality of inferences from public use data. Comput. Stat. Data Anal. 53:41475–82 [Google Scholar]
  44. Quinlan JR. 1986. Induction of decision trees. Mach. Learn. 1:81–106 [Google Scholar]
  45. Reiter JP. 2003. Model diagnostics for remote access regression servers. Stat. Comput. 13:371–80 [Google Scholar]
  46. Reiter JP. 2004. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv. Methodol. 30:235–42 [Google Scholar]
  47. Reiter JP. 2005a. Estimating risks of identification disclosure for microdata. J. Am. Stat. Assoc. 100:1103–13 [Google Scholar]
  48. Reiter JP. 2005b. Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A 168:185–205 [Google Scholar]
  49. Reiter JP. 2005c. Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21:441–62 [Google Scholar]
  50. Reiter JP, Wang Q, Zhang B. 2014. Bayesian estimation of disclosure risks in multiply imputed, synthetic data. J. Priv. Confid. 6:117–33 [Google Scholar]
  51. Rubin DB. 1993. Discussion: statistical disclosure limitation. J. Off. Stat. 9:462–68 [Google Scholar]
  52. Samet S, Miri A. 2008. Privacy-preserving protocols for perceptron learning algorithm in neural networks. Proc. 4th Int. IEEE Conf. Intell. Syst. 210–65 New York: IEEE [Google Scholar]
  53. Samet S, Miri A. 2011. Privacy-Preserving Data Mining Saarbücken, Ger.: VDM
  54. Sanil AP, Karr AF, Lin X, Reiter JP. 2004. Privacy preserving regression modelling via distributed computation. Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min.677–82 New York: ACM [Google Scholar]
  55. Sanil AP, Karr AF, Lin X, Reiter JP. 2009. Privacy preserving analysis of vertically partitioned data using secure matrix products. J. Off. Stat. 25:1125–38 [Google Scholar]
  56. Schadow C, Grannis S, McDonald C. 2002. Privacy-preserving distributed queries for a clinical case research network. Privacy, Security and Data Mining, Conferences in Research and Practice in Information Technology 14 C Clifton, V Estivill-Castro 55–65 Sydney: Aust. Comput. Soc.
  57. Schneier B. 1995. Applied Cryptography New York: Wiley
  58. Schouten B, Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Stat. Comput. 13:381–89 [Google Scholar]
  59. Singh AC. 2010. Maintaining analytic utility while protecting confidentiality of survey and nonsurvey data. J. Priv. Confid. 1:2155–82 [Google Scholar]
  60. Stiglitz JE. 1999. Knowledge as a global public good. Global Public Goods: International Cooperation in the 21st Century I Kaul, I Grunberg, MA Stern 308–325 New York: Oxford Univ. Press [Google Scholar]
  61. Sweeney L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10:5557–70 [Google Scholar]
  62. Vaidya J, Clifton C. 2002. Privacy preserving association rule mining in vertically partitioned data. Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Edmonton, Can.639–44 New York: ACM [Google Scholar]
  63. Vaidya J, Clifton C. 2003. Privacy preserving k-means clustering over vertically partitioned data. Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min.206–15 New York: ACM [Google Scholar]
  64. Vaidya J, Clifton C, Zhu M. 2006. Privacy Preserving Data Mining New York: Springer-Verlag
  65. Varian HR. 1992. Microeconomic Analysis New York: Norton, 3rd ed..
  66. Willenborg LCRJ, de Waal T. 1996. Statistical Disclosure Control in Practice New York: Springer-Verlag
  67. Willenborg LCRJ, de Waal T. 2001. Elements of Statistical Disclosure Control New York: Springer-Verlag
  68. Woo M-J, Reiter JP, Oganian A, Karr AF. 2009. Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1:1111–24 [Google Scholar]
  69. Xiao M-J, Huang L-S, Luo Y-L, Shen H. 2005. Privacy preserving ID3 algorithm over horizontally partitioned data. Proc. 6th Int. Conf. Parallel Distrib. Comput. Appl. Technol.239–43 New York: IEEE [Google Scholar]
  70. Yao AC. 1982. Protocols for secure computations. Proc. 23rd Annu. IEEE Symp. Found. Comput. Sci.160–64 New York: IEEE [Google Scholar]
  71. Young SS, Karr AF. 2011. Deming, data and observational studies: a process out of control and needing fixing. Significance 8:3116–20 [Google Scholar]
/content/journals/10.1146/annurev-statistics-041715-033438
Loading
/content/journals/10.1146/annurev-statistics-041715-033438
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error