Methods Based on Semiparametric Theory for Analysis in the Presence of Missing Data

Marie Davidian

doi:10.1146/annurev-statistics-040120-025906

Annual Review of Statistics and Its Application

Volume 9, 2022

Review Article

Free

Methods Based on Semiparametric Theory for Analysis in the Presence of Missing Data

Marie Davidian¹
View Affiliations Hide Affiliations

Affiliations: Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA; email: [email protected]
Vol. 9:167-196 (Volume publication date March 2022) https://doi.org/10.1146/annurev-statistics-040120-025906
First published as a Review in Advance on August 20, 2021
Copyright © 2022 by Annual Reviews. All rights reserved

Abstract

A statistical model is a class of probability distributions assumed to contain the true distribution generating the data. In parametric models, the distributions are indexed by a finite-dimensional parameter characterizing the scientific question of interest. Semiparametric models describe the distributions in terms of a finite-dimensional parameter and an infinite-dimensional component, offering more flexibility. Ordinarily, the statistical model represents distributions for the full data intended to be collected. When elements of these full data are missing, the goal is to make valid inference on the full-data-model parameter using the observed data. In a series of fundamental works, Robins, Rotnitzky, and colleagues derived the class of observed-data estimators under a semiparametric model assuming that the missingness mechanism is at random, which leads to practical, robust methodology for many familiar data-analytic challenges. This article reviews semiparametric theory and the key steps in this derivation.

Keyword(s): augmentation, double robustness, influence function, inverse probability weighting of complete cases, missing at random, monotone missingness

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040120-025906

2022-03-07

2024-05-03

Full text loading...

/deliver/fulltext/statistics/9/1/annurev-statistics-040120-025906.html?itemId=/content/journals/10.1146/annurev-statistics-040120-025906&mimeType=html&fmt=ahah

Literature Cited

Bang H, Robins JM. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61:962–72
[Google Scholar]
Bang H, Tsiatis AA. 2000. Estimating medical costs with censored data. Biometrika 87:329–43
[Google Scholar]
Bang H, Tsiatis AA. 2002. Median regression with censored medical cost data. Biometrics 58:643–50
[Google Scholar]
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. 1993. Efficient and Adaptive Estimation for Semiparametric Models Baltimore, MD: Johns Hopkins Univ. Press
Cao W, Tsiatis AA, Davidian M. 2009. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96:723–34
[Google Scholar]
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. 2006. Measurement Error in Nonlinear Models: A Modern Perspective Boca Raton, FL: Chapman & Hall/CRC. , 2nd ed..
Casella G, Berger RL. 2002. Statistical Inference New York: Duxbury. , 2nd ed..
Daniel RM. 2018. Double robustness. Wiley StatsRef: Statistics Reference Online New York: Wiley https://doi.org/10.1002/9781118445112.stat08068
[Crossref] [Google Scholar]
Davidian M, Tsiatis AA, Leon S 2005. Semiparametric estimation of treatment effect in a pretest-posttest study with missing data (with discussion). Stat. Sci. 20:261–301
[Google Scholar]
Gill RD, van der Laan MJ, Robins JM 1997. Coarsening at random: characterizations, conjectures and counter-examples. Proceedings of the First Seattle Conference on Biostatistics DY Lin, TR Fleming 255–94 Berlin: Springer-Verlag
[Google Scholar]
Hammer SM, Katzenstein DA, Hughes MD, Gundaker H, Schooley RT et al. 1996. A trial comparing nucleoside monotherapy with combination therapy in HIV infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 335:1081–90
[Google Scholar]
Heitjan DF, Rubin DB. 1991. Ignorability and coarse data. Ann. Stat. 19:2244–53
[Google Scholar]
Hernán MA, Brumback B, Robins JM. 2000. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 11:561–70
[Google Scholar]
Horvitz DG, Thompson DJ. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85
[Google Scholar]
Kang JDY, Schafer JL. 2007. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data (with discussion). Stat. Sci. 22:523–39
[Google Scholar]
Leon S, Tsiatis AA, Davidian M. 2003. Semiparametric estimation of treatment effect in a pretest-posttest study. Biometrics 59:1046–55
[Google Scholar]
Liang KY, Zeger SL. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
[Google Scholar]
Little RJ. 2021. Missing data assumptions. Annu. Rev. Stat. Appl. 8:89–107
[Google Scholar]
Little RJA, Rubin DB. 2019. Statistical Analysis with Missing Data New York: Wiley. , 3rd ed..
Luenberger DG. 1969. Optimization by Vector Space Methods New York: Wiley
Lunceford JK, Davidian M. 2004. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat. Med. 23:2937–60
[Google Scholar]
McCaffrey DF, Ridgeway G, Morral AR. 2004. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9:403–25
[Google Scholar]
Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. 2015. Handbook of Missing Data Methodology Boca Raton, FL: Chapman & Hall/CRC Press
Molenberghs G, Kenward MG. 2007. Missing Data in Clinical Studies New York: Wiley
Newey WK. 1990. Semiparametric efficiency bounds. J. Appl. Econ. 5:99–135
[Google Scholar]
Orellana L, Rotnitzky A, Robins JM. 2010. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content. Int. J. Biostat. 6:8
[Google Scholar]
Qin J, Zhang B, Leung DHY. 2017. Efficient augmented inverse probability weighted estimation in missing data problems. J. Bus. Econ. Stat. 35:86–97
[Google Scholar]
Robins JM. 1986. A new approach to causal inference in mortality studies with sustained exposure periods: application to control of the healthy worker survivor effect. Math. Model. 7:1393–512
[Google Scholar]
Robins JM 1999. Marginal structural models versus structural nested models as tools for causal inference. Statistical Models in Epidemiology: The Environment and Clinical Trials ME Halloran, D Berry 95–134 New York: Springer-Verlag
[Google Scholar]
Robins JM, Gill RD. 1997. Non-response models for the analysis of non-monotone ignorable missing data. Stat. Med. 16:39–56
[Google Scholar]
Robins JM, Hernán MA, Brumback B. 2000. Marginal structural models in causal inference in epidemiology. Epidemiology 11:550–60
[Google Scholar]
Robins JM, Rotnitzky A 1992. Recovery of information and adjustment for dependent censoring using surrogate markers. AIDS Epidemiology–Methodological Issues NP Jewell, K Dietz, V Farewell 297–331 Boston: Birkhäuser
[Google Scholar]
Robins JM, Rotnitzky A. 1995. Semiparametric efficiency in multivariate regression models with missing data. J. Am. Stat. Assoc. 90:122–29
[Google Scholar]
Robins JM, Rotnitzky A, Scharfstein DO. 1999. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. Statistical Models in Epidemiology: The Environment and Clinical Trials ME Halloran, D Berry 1–92 New York: Springer
[Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. 1994. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89:846–66
[Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. 1995. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 90:106–21
[Google Scholar]
Rotnitzky A 2009. Inverse probability weighted methods. Longitudinal Data Analysis G Fitzmaurice, M Davidian, G Verbeke, G Molenberghs 453–76 Boca Raton, FL: Chapman & Hall/CRC
[Google Scholar]
Rotnitzky A, Lei QH, Sued M, Robins JM. 2012. Improved double-robust estimation in missing data and causal inference models. Biometrika 99:439–56
[Google Scholar]
Rotnitzky A, Robins JM. 1997. Analysis of semiparametric regression models with non-ignorable non-response. Stat. Med. 16:81–102
[Google Scholar]
Rotnitzky A, Robins JM, Scharfstein D. 1998. Semiparametric regression for repeated outcomes with nonignorable nonresponse (with discussion). J. Am. Stat. Assoc. 94:1096–120
[Google Scholar]
Rotnitzky A, Vansteelandt S 2015. Double-robust methods. Handbook of Missing Data Methodology G Molenberghs, G Fitzmaurice, MG Kenward, A Tsiatis, G Verbeke 185–212 Boca Raton, FL: Chapman & Hall/CRC
[Google Scholar]
Rubin DB. 1976. Inference and missing data. Biometrika 63:581–92
[Google Scholar]
Rubin DB, van der Laan MJ. 2008. Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. Int. J. Biostat. 4:5
[Google Scholar]
Scharfstein DO, Rotnitzky A, Robins JM. 1999a. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion). J. Am. Stat. Assoc. 94:1096–120
[Google Scholar]
Scharfstein DO, Rotnitzky A, Robins JM. 1999b. Rejoinder to ``Adjusting for nonignorable drop-out using semiparametric nonresponse models. .'' J. Am. Stat. Assoc. 94:1135–46
[Google Scholar]
Seaman S, Galati J, Jackson D, Carlin J 2013. What is meant by ``missing at random''?. Stat. Sci. 28:257–68
[Google Scholar]
Seaman SR, Vansteelandt S. 2018. Introduction to double robust methods for incomplete data. Stat. Sci. 33:184–97
[Google Scholar]
Stefanski LA, Boos DD. 2002. The calculus of M-estimation. Am. Stat. 56:29–38
[Google Scholar]
Tan Z. 2006. A distributional approach for causal inference using propensity scores. J. Am. Stat. Assoc. 101:1619–37
[Google Scholar]
Tan Z. 2007. Comment: understanding OR, PS and DR. Stat. Sci. 22:560–8
[Google Scholar]
Tan Z. 2008. Comment: improved local efficiency and double robustness. Int. J. Biostat. 4:10
[Google Scholar]
Tan Z. 2011. Bounded, efficient, and double-robust estimation with inverse probability weighting. Biometrika 97:661–82
[Google Scholar]
Tsiatis AA. 2006. Semiparametric Theory and Missing Data New York: Springer
Tsiatis AA, Davidian M 2015. Missing data methods: a semi-parametric perspective. Handbook of Missing Data Methodology G Molenberghs, G Fitzmaurice, MG Kenward, A Tsiatis, G Verbeke 149–84 Boca Raton, FL: Chapman & Hall/CRC
[Google Scholar]
Tsiatis AA, Davidian M, Cao W. 2011. Improved doubly robust estimation when data are monotonely coarsened, with application to longitudinal studies with dropout. Biometrics 67:536–45
[Google Scholar]
Tsiatis AA, Davidian M, Zhang M, Lu X. 2008. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat. Med. 27:4658–77
[Google Scholar]
van der Laan MJ, Hubbard A. 1999. Locally efficient estimation of the quality adjusted lifetime distribution with right-censored data and covariates. Biometrics 55:530–6
[Google Scholar]
van der Laan MJ, Hubbard A, Robins J. 2002. Locally efficient estimation of a multivariate survival function with right-censored data and covariates. J. Am. Stat. Assoc. 97:494–507
[Google Scholar]
van der Laan MJ, Robins JM. 2003. Unified Methods for Censored and Longitudinal Data and Causality New York: Springer
Vansteelandt S, Rotnitzky A, Robins JM. 2007. Estimation of regression models for the mean of repeated outcomes under non-ignorable non-monotone non-response. Biometrika 94:841–60
[Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. 2012. A robust method for estimating optimal treatment regimes. Biometrics 68:1010–18
[Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. 2013. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrics 100:681–94
[Google Scholar]
Zhang M, Tsiatis AA, Davidian M. 2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics 64:707–15
[Google Scholar]
Zhao H, Tsiatis AA. 1997. A consistent estimator for the distribution of quality adjusted survival time. Biometrika 84:339–48
[Google Scholar]
Zhao H, Tsiatis AA. 1999. Efficient estimation of the distribution of quality adjusted survival time. Biometrics 55:1101–7
[Google Scholar]
Zhao Y, Zeng D, Laber EB, Kosorok MR. 2015. New statistical learning methods for estimating optimal dynamic treatment regimes. J. Am. Stat. Assoc. 110:583–98
[Google Scholar]
Zhao Y, Zeng D, Rush AJ, Kosorok MR. 2012. Estimating individualized treatment rules using outcome weighted learning. J. Am. Stat. Assoc. 107:1106–18
[Google Scholar]

/content/journals/10.1146/annurev-statistics-040120-025906

Methods Based on Semiparametric Theory for Analysis in the Presence of Missing Data

Annual Review of Statistics and Its Application 9, 167 (2022); https://doi.org/10.1146/annurev-statistics-040120-025906

/content/journals/10.1146/annurev-statistics-040120-025906

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 9, 2022

Review Article

Free

Methods Based on Semiparametric Theory for Analysis in the Presence of Missing Data

Abstract

Most Read This Month

Most Cited Most Cited RSS feed