Data Sharing and Access

Alan F. Karr

doi:10.1146/annurev-statistics-041715-033438

Annual Review of Statistics and Its Application

Volume 3, 2016

Review Article

Free

Data Sharing and Access

Alan F. Karr¹
View Affiliations Hide Affiliations

Affiliations: RTI International, Research Triangle Park, North Carolina 27709; email: [email protected]
Vol. 3:113-132 (Volume publication date June 2016) https://doi.org/10.1146/annurev-statistics-041715-033438
© Annual Reviews

Abstract

Data sharing and access are venerable problems embedded in a rapidly changing milieu. Pressure points include the increasingly data-driven nature of science, the volume, complexity, and distributed nature of data, new concerns regarding privacy and confidentiality, and rising attention to reproducibility of research. In the context of research data, this review surveys extant technologies, articulates a number of identified and emerging issues, and outlines one path for the future. Recognizing that data availability is a public good, research data archives can provide economic and scientific value to both data generators and data consumers in a way that engenders trust. The overall framework is statistical—the use of data for inference.

Article metrics loading...

/content/journals/10.1146/annurev-statistics-041715-033438

2016-06-01

2024-05-10

Full text loading...

/deliver/fulltext/statistics/3/1/annurev-statistics-041715-033438.html?itemId=/content/journals/10.1146/annurev-statistics-041715-033438&mimeType=html&fmt=ahah

Literature Cited

Abowd JM. 2013. Presentation: revisiting the economics of privacy: population statistics and privacy as public good http://www.digitalcommons.ilr.cornell.edu/cgi/viewcontent.cgi?article=1009&context=ldi
Benaloh J. 1987. Secret sharing homomorphisms: keeping shares of a secret secret. CRYPTO86 AM Odlyzko 251–60 Lect. Notes Comput. Sci. Ser. 263 New York: Springer-Verlag [Google Scholar]
Brandeis LD, Warren SD. 1890. The right to privacy. Harvard Law Rev. 4:5193 [Google Scholar]
Christen P. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection New York: Springer-Verlag
Clifton C, Doan A, Elmagarmid A, Kantarcioglu M, Schadow G. et al. 2004. Privacy-preserving data integration and sharing Presented at 9th ACM SIGMOD Workshop Res. Issues Data Min. Knowl. Discov., Paris
Cox LH, Karr AF, Kinney SK. 2011. Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act (with discussion). Int. Stat. Rev. 79:2160–99 [Google Scholar]
Dobra A, Fienberg SE, Karr AF, Sanil AP. 2002. Software systems for tabular data releases. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10:5529–44 [Google Scholar]
Dobra A, Karr AF, Sanil AP. 2003. Preserving confidentiality of high-dimensional tabulated data: statistical and computational issues. Stat. Comput. 13:363–70 [Google Scholar]
Doyle P, Lane JI, Theeuwes JJM, Zayatz LV. 2001. Confidentiality, Disclosure and Data Access: Theory and Practical Application for Statistical Agencies Amsterdam: Elsevier
Du W, Han Y, Chen S. 2004. Privacy-preserving multivariate statistical analysis: linear regression and classification. Proc. 4th SIAM Int. Conf. Data Min.222–33 Philadelphia: Soc. Ind. Math. [Google Scholar]
Duncan GT, Keller-McNulty SA, Stokes SL. 2004. Database security and confidentiality: examining disclosure risk versus data utility through the R-U confidentiality map Tech. Rep. 142, Natl. Inst. Stat. Sci., Research Triangle Park, N.C. http://www.niss.org/sites/default/files/pdfs/technicalreports/tr142.pdf
Dwork C. 2006. Differential privacy. Automata, Languages and Programming M Bugliesi, B Preneel, V Sassone, I Wegener 1–12 Lect. Notes Comput. Sci. Ser. 4052 New York: Springer-Verlag [Google Scholar]
Dwork C. 2008. Differential privacy: a survey of results. Theory and Applications of Models of Computation M Agrawal, D Du, Z Duan, A Li 1–19 Lect. Notes Comput. Sci. Ser. 4978 New York: Springer-Verlag [Google Scholar]
Dwork C. 2014. Differential privacy: a cryptographic approach to private data analysis. Big Data, Privacy, and the Public Good: Frameworks for Engagement J Lane, V Stodden, S Bender, H Nissenbaum 296–322 Cambridge, UK: Cambridge Univ. Press [Google Scholar]
Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. 2015. The reusable holdout: preserving validity in adaptive data analysis. Science 349:636–38 [Google Scholar]
Fienberg SE. 2010. The relevance or irrelevance of weights in statistical disclosure limitation. J. Priv. Confid. 1:2183–95 [Google Scholar]
Fienberg SE, Karr AF, Nardi Y, Slavkovic A. 2007. Secure logistic regression with distributed databases Presented at 56th Session Int. Stat. Inst., Lisbon
Ghosh J, Reiter JP, Karr AF. 2007. Secure computation with horizontally partitioned data using adaptive regression splines. Comput. Stat. Data Anal. 51:5813–20 [Google Scholar]
Goldwasser S. 1997. Multi-party computations: past and present. Proc. 16th Annu. ACM Symp. Princ. Distrib. Comput. Santa Barbara, CA 1–6 New York: ACM [Google Scholar]
Gomatam S, Karr AF, Reiter JP, Sanil AP. 2005a. Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access analysis servers. Stat. Sci. 20:2163–77 [Google Scholar]
Gomatam S, Karr AF, Sanil AP. 2005b. Data swapping as a decision problem. J. Off. Stat. 21:4635–56 [Google Scholar]
Gravelle H, Rees R. 2004. Microeconomics New York: Prentice Hall
Hall R, Fienberg SE. 2013. Privacy-preserving record linkage. Work. Pap., Dep. Stat., Carnegie Mellon Univ. http://www.cs.cmu.edu/∼rjhall/linkage_survey_final.pdf
Holan SH, Toth D, Ferreira MAR, Karr AF. 2010. Bayesian multiscale multiple imputation with implications to data confidentiality. J. Am. Stat. Assoc. 105:564–77 [Google Scholar]
Horvitz DG, Thompson DJ. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85 [Google Scholar]
Ioannidis JPA. 2005. Why most published research findings are false. PLOS Med. 2:8e124 [Google Scholar]
Karr AF. 2009. The role of transparency in statistical disclosure limitation. Proc. Jt. UNECE/Eurostat Work Session Stat. Data Confid. Bilbao, Spain 2–4 Geneva, Switz: U.N. Econ. Comm. Eur http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.41.e.pdf [Google Scholar]
Karr AF. 2010. Secure statistical analysis of distributed databases, emphasizing what we don't know. J. Priv. Confid. 1:2197–211 [Google Scholar]
Karr AF. 2014. Why data availability is such a hard problem. Stat. J. IAOS 30:101–7 [Google Scholar]
Karr AF, Feng J, Lin X, Sanil AP, Young SS, Reiter JP. 2005. Secure analysis of distributed chemical databases without data integration. J. Comput. Aided Mol. Des. 19:9–10739–47 [Google Scholar]
Karr AF, Fulp WJ, Lin X, Reiter JP, Vera F, Young SS. 2007. Secure, privacy-preserving analysis of distributed databases. Technometrics 49:3335–45 [Google Scholar]
Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. 2006. A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60:224–32 [Google Scholar]
Karr AF, Lee J, Sanil AP, Hernandez J, Karimi S, Litwin K. 2002. Web-based systems that disseminate information from data but preserve confidentiality. Advances in Digital Government: Technology, Human Factors, and Policy E Elmagarmid, WM McIver 181–96 Boston: Kluwer [Google Scholar]
Karr AF, Lin X. 2010. Privacy-preserving maximum likelihood estimation for distributed data. J. Priv. Confid. 1:2213–22 [Google Scholar]
Karr AF, Reiter JP. 2014. Using statistics to protect privacy. Big Data, Privacy, and the Public Good: Frameworks for Engagement J Lane, V Stodden, S Bender, H Nissenbaum 276–95 Cambridge, UK: Cambridge Univ. Press [Google Scholar]
Kim HJ, Karr AF, Reiter JP. 2015. Statistical disclosure limitation in the presence of edit rules. J. Off. Stat. 21:121–38 [Google Scholar]
Kinney SK, Gonzalez JF Jr, Karr AF. 2009. Data confidentiality—the next five years: summary and guide to papers. J. Priv. Confid. 1:2125–34 [Google Scholar]
Lindell Y, Pinkas B. 2000. Privacy preserving data mining. Adv. Cryptol.—Crypto200020–24 Lect. Notes Comput. Sci. Ser. 1880 New York: Springer-Verlag [Google Scholar]
Moore A, Lee M. 1998. Cached sufficient statistics for efficient machine learning with large datasets. J. Artif. Intell. Res. 8:67–91 [Google Scholar]
NRC (Natl. Res. Counc.) 2014. Proposed Revisions to the Common Rule for the Protection of Human Subjects in the Behavioral and Social Sciences Wash., DC: Nat. Acad. Press
NSF (Natl. Sci. Found.) 2015. 45 CFR part 690: Federal policy for the protection of human subjects, subpart A: the common rule for the protection of human subjects https://www.nsf.gov/bfa/dias/policy/docs/45cfr690.pdf [Google Scholar]
Oganian A, Karr AF. 2006. Combinations of SDC methods for microdata protection. Privacy in Statistical Databases J Domingo-Ferrer, L Franconi 102–13 Lect. Notes Comput. Sci. Ser. 4302 New York: Springer-Verlag [Google Scholar]
Oganian A, Reiter JP, Karr AF. 2009. Verification servers: enabling analysts to assess the quality of inferences from public use data. Comput. Stat. Data Anal. 53:41475–82 [Google Scholar]
Quinlan JR. 1986. Induction of decision trees. Mach. Learn. 1:81–106 [Google Scholar]
Reiter JP. 2003. Model diagnostics for remote access regression servers. Stat. Comput. 13:371–80 [Google Scholar]
Reiter JP. 2004. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv. Methodol. 30:235–42 [Google Scholar]
Reiter JP. 2005a. Estimating risks of identification disclosure for microdata. J. Am. Stat. Assoc. 100:1103–13 [Google Scholar]
Reiter JP. 2005b. Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A 168:185–205 [Google Scholar]
Reiter JP. 2005c. Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21:441–62 [Google Scholar]
Reiter JP, Wang Q, Zhang B. 2014. Bayesian estimation of disclosure risks in multiply imputed, synthetic data. J. Priv. Confid. 6:117–33 [Google Scholar]
Rubin DB. 1993. Discussion: statistical disclosure limitation. J. Off. Stat. 9:462–68 [Google Scholar]
Samet S, Miri A. 2008. Privacy-preserving protocols for perceptron learning algorithm in neural networks. Proc. 4th Int. IEEE Conf. Intell. Syst. 210–65 New York: IEEE [Google Scholar]
Samet S, Miri A. 2011. Privacy-Preserving Data Mining Saarbücken, Ger.: VDM
Sanil AP, Karr AF, Lin X, Reiter JP. 2004. Privacy preserving regression modelling via distributed computation. Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min.677–82 New York: ACM [Google Scholar]
Sanil AP, Karr AF, Lin X, Reiter JP. 2009. Privacy preserving analysis of vertically partitioned data using secure matrix products. J. Off. Stat. 25:1125–38 [Google Scholar]
Schadow C, Grannis S, McDonald C. 2002. Privacy-preserving distributed queries for a clinical case research network. Privacy, Security and Data Mining, Conferences in Research and Practice in Information Technology 14 C Clifton, V Estivill-Castro 55–65 Sydney: Aust. Comput. Soc.
Schneier B. 1995. Applied Cryptography New York: Wiley
Schouten B, Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Stat. Comput. 13:381–89 [Google Scholar]
Singh AC. 2010. Maintaining analytic utility while protecting confidentiality of survey and nonsurvey data. J. Priv. Confid. 1:2155–82 [Google Scholar]
Stiglitz JE. 1999. Knowledge as a global public good. Global Public Goods: International Cooperation in the 21st Century I Kaul, I Grunberg, MA Stern 308–325 New York: Oxford Univ. Press [Google Scholar]
Sweeney L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10:5557–70 [Google Scholar]
Vaidya J, Clifton C. 2002. Privacy preserving association rule mining in vertically partitioned data. Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Edmonton, Can.639–44 New York: ACM [Google Scholar]
Vaidya J, Clifton C. 2003. Privacy preserving k-means clustering over vertically partitioned data. Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min.206–15 New York: ACM [Google Scholar]
Vaidya J, Clifton C, Zhu M. 2006. Privacy Preserving Data Mining New York: Springer-Verlag
Varian HR. 1992. Microeconomic Analysis New York: Norton, 3rd ed..
Willenborg LCRJ, de Waal T. 1996. Statistical Disclosure Control in Practice New York: Springer-Verlag
Willenborg LCRJ, de Waal T. 2001. Elements of Statistical Disclosure Control New York: Springer-Verlag
Woo M-J, Reiter JP, Oganian A, Karr AF. 2009. Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1:1111–24 [Google Scholar]
Xiao M-J, Huang L-S, Luo Y-L, Shen H. 2005. Privacy preserving ID3 algorithm over horizontally partitioned data. Proc. 6th Int. Conf. Parallel Distrib. Comput. Appl. Technol.239–43 New York: IEEE [Google Scholar]
Yao AC. 1982. Protocols for secure computations. Proc. 23rd Annu. IEEE Symp. Found. Comput. Sci.160–64 New York: IEEE [Google Scholar]
Young SS, Karr AF. 2011. Deming, data and observational studies: a process out of control and needing fixing. Significance 8:3116–20 [Google Scholar]

/content/journals/10.1146/annurev-statistics-041715-033438

Data Sharing and Access

Annual Review of Statistics and Its Application 3, 113 (2016); https://doi.org/10.1146/annurev-statistics-041715-033438

/content/journals/10.1146/annurev-statistics-041715-033438

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 3, 2016

Review Article

Free

Data Sharing and Access

Abstract

Most Read This Month

Most Cited Most Cited RSS feed