1932

Abstract

Demand for access to data, especially data collected using public funds, is ever growing. At the same time, concerns about the disclosure of the identities of and sensitive information about the respondents providing the data are making the data collectors limit the access to data. Synthetic data sets, generated to emulate certain key information found in the actual data and provide the ability to draw valid statistical inferences, are an attractive framework to afford widespread access to data for analysis while mitigating privacy and confidentiality concerns. The goal of this article is to provide a review of various approaches for generating and analyzing synthetic data sets, inferential justification, limitations of the approaches, and directions for future research.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040720-031848
2021-03-07
2024-06-17
Loading full text...

Full text loading...

/deliver/fulltext/statistics/8/1/annurev-statistics-040720-031848.html?itemId=/content/journals/10.1146/annurev-statistics-040720-031848&mimeType=html&fmt=ahah

Literature Cited

  1. Abowd JM, Stephens BW, Vilhuber L, Andersson F, McKinney KL et al. 2009. The LEHD infrastructure files and the creation of the quarterly workforce indicators. Producer Dynamics: New Evidence from Micro Data T Dunne, JB Jensen, MJ Roberts 149–230 Chicago: Univ. Chicago Press
    [Google Scholar]
  2. Abowd JM, Woodcock SD. 2001. Disclosure limitation in longitudinal linked data. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies P Doyle, J Lane, J Theeuwes, L Zayatz 215–77 New York: North Holland
    [Google Scholar]
  3. Abowd JM, Woodcock SD. 2004. Multiply-imputing confidential characteristics and file links in longitudinal linked data. Privacy in Statistical Databases J Domingo-Ferrer, V Torra 290–97 Heidelberg, Ger: Springer-Verlag
    [Google Scholar]
  4. Bowen CM, Snoke J. 2019. Comparative study of differentially private synthetic data algorithms and evaluation standards. arXiv:1911.12704 [stat.AP]
  5. Burman LE, Engler A, Khitatrakun S, Nunns JR, Armstrong S et al. 2018. Safely expanding research access to administrative tax data: creating a synthetic public use file and a validation server Res. Rep., Tax Policy Cent., Urban Inst., and Brookings Inst Washington, DC:
    [Google Scholar]
  6. book 2013. Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) Baltimore, MD: US Cent. Medicare Medicaid Serv.
    [Google Scholar]
  7. Cox LH. 1994. Matrix masking methods for disclosure limitation in micro data. Surv. Methodol. 20:165–69
    [Google Scholar]
  8. Dalenius T. 1974. The invasion of privacy problem and statistics production—an overview. Stat. Tidskr. 3:213–25
    [Google Scholar]
  9. Dalenius T. 1977. Privacy transformations for statistical information systems. J. Stat. Plann. Inference 1:73–86
    [Google Scholar]
  10. Dalenius T. 1978. Information privacy and statistics: a topical bibliography Rep. 41, Bur. Census, Dep. Commer Washington, DC:
    [Google Scholar]
  11. DuMouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D 1999. Squashing flat files flatter. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining U Fayyad, S Chaudhuri, D Madigan 6–15 New York: ACM
    [Google Scholar]
  12. Duncan GT 1993. Special issue: confidentiality and data access. J. Off. Stat. 9:2)
    [Google Scholar]
  13. Duncan GT, Pearson RB. 1991. Enhancing access to micro-data while protecting confidentiality: prospects for the future. Stat. Sci. 6:219–39
    [Google Scholar]
  14. Dwork C, McSherry F, Nissim K, Smith A 2006. Calibrating noise to sensitivity in private data analysis. Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006 S Halevi, T Rabin 265–84 New York: Springer
    [Google Scholar]
  15. Fienberg SE, Makov UE, Steele RJ 1998. Disclosure limitation using perturbation and related methods for categorical data (with discussions). J. Off. Stat. 14:485–511
    [Google Scholar]
  16. Fienberg SE, Willenborg LCRJ 1998. Special issue: disclosure limitation methods for protecting the confidentiality of statistical data. J. Off. Stat. 14:4)
    [Google Scholar]
  17. Fuller WA. 1993. Masking procedures for microdata disclosure limitation. J. Off. Stat. 9:383–406
    [Google Scholar]
  18. Gouweleeuw PK, Willenborg LCRJ, de Wolf PP 1998. Post randomization for statistical disclosure control: theory and implementation. J. Off. Stat. 14:463–78
    [Google Scholar]
  19. Jabine TB. 1993. Statistical disclosure limitation practices of United States statistical agencies. J. Off. Stat. 9:427–54
    [Google Scholar]
  20. Keller WJ, Kooiman P 1992. Special issue: proceedings of the International Symposium on Statistical Disclosure Avoidance. Stat. Neerl 46:1)
    [Google Scholar]
  21. Kennickell AB. 1999. Multiple imputation and disclosure protection: the case of the 1995 Survey of Consumer Finances. Statistical Data Protection J Domingo-Ferrer 248–67 Luxembourg: Off. Off. Publ. Eur. Communities
    [Google Scholar]
  22. Lane JI, Doyle P, Zayatz L, Theeuwes J 2001. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies Amsterdam: North-Holland
    [Google Scholar]
  23. Little RJA. 1993. Statistical analysis of masked data. J. Off. Stat. 9:407–26
    [Google Scholar]
  24. Little RJA, Liu F, Raghunathan TE 2004. Statistical disclosure techniques based on multiple imputation. Applied Modeling and Causal Inference from Incomplete-Data Perspectives A Gelman, XL Meng 141–52 New York: Wiley
    [Google Scholar]
  25. Madigan D, Raghavan N, DuMouchel W, Nason M, Posse C, Ridgeway G 2002. Likelihood-based data squashing: a modeling approach to instance construction. Data Min. Knowl. Discov. 6:173–90
    [Google Scholar]
  26. McClure D, Reiter JP. 2012. Differential privacy and statistical disclosure risk measures: an investigation with binary synthetic data. Trans. Data Privacy 5:3535–52
    [Google Scholar]
  27. McSherry F, Talwar K. 2007. Mechanism design via differential privacy. Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science94–103 Piscataway, NJ: IEEE
    [Google Scholar]
  28. Meng XL. 1994. Multiple imputation with uncongenial sources of input (with discussion). Stat. Sci. 9:538–74
    [Google Scholar]
  29. book 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy Washington, DC: Natl. Acad. Press
    [Google Scholar]
  30. Nowok B, Raab GM, Dibben C 2016. synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74:11)1–26
    [Google Scholar]
  31. book 1993. Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics Washington, DC: Natl. Acad. Press
    [Google Scholar]
  32. book 2003. Protecting Participants and Facilitating Social and Behavioral Sciences Research Washington, DC: Natl. Acad. Press
    [Google Scholar]
  33. book 2005. Expanding Access to Research Data: Reconciling Risks and Opportunities Washington, DC: Natl. Acad. Press
    [Google Scholar]
  34. Raghunathan TE, Berglund P, Solenberger PW 2018a. Multiple Imputation in Practice: With Examples Using IVEware Boca Raton, FL: CRC
    [Google Scholar]
  35. Raghunathan TE, Berglund P, Solenberger PW, Van Hoewyk J 2018b. IVEware: imputation and variance estimation software user guide Version 0.3. Tech. Rep., Surv. Methodol. Progr., Surv. Res. Cent., Inst. Soc. Res., Univ. Mich Ann Arbor: http://www.isr.umich.edu/src/smp/ive/
    [Google Scholar]
  36. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P 2001. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27:85–95
    [Google Scholar]
  37. Raghunathan TE, Reiter JP, Rubin DB 2003. Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19:1–16
    [Google Scholar]
  38. Raghunathan TE, Rubin DB. 2000. Bayesian multiple imputation to preserve confidentiality in public-use data sets Talk presented at ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis Heronissos, Greece:
    [Google Scholar]
  39. Reiter JP. 2002. Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18:531–43
    [Google Scholar]
  40. Reiter JP. 2003. Inferences for partially synthetic, public use microdata sets. Surv. Methodol. 29:181–88
    [Google Scholar]
  41. Reiter JP, Dreschsler J. 2010. Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality. Stat. Sin. 20:405–21
    [Google Scholar]
  42. Rubin DB. 1987. Multiple Imputation for Nonresponse in Surveys New York: Wiley
    [Google Scholar]
  43. Rubin DB. 1993. Discussion of statistical disclosure limitation. J. Off. Stat. 9:461–68
    [Google Scholar]
  44. Sakshaug JW, Raghunathan TE. 2010. Synthetic data for small area estimation. PSD 2010: Privacy in Statistical Databases J Domingo-Ferrer, E Magkos 162–73 Heidelberg, Ger.: Springer
    [Google Scholar]
  45. Surendra H, Mohan HS. 2017. A review of synthetic data generation methods for privacy preserving data publishing. Int. J. Sci. Technol. Res. 6:95–101
    [Google Scholar]
  46. Walonoski J, Kramer M, Nichols J, Quina A, Moesel C et al. 2018. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25:230–38
    [Google Scholar]
  47. Wasserman L, Zhou S. 2010. A statistical framework for differential privacy. J. Am. Stat. Assoc. 105:489375–89
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040720-031848
Loading
  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error