Endogenous selection bias is a central problem for causal inference. Recognizing the problem, however, can be difficult in practice. This article introduces a purely graphical way of characterizing endogenous selection bias and of understanding its consequences (Hernán et al. 2004). We use causal graphs (direct acyclic graphs, or DAGs) to highlight that endogenous selection bias stems from conditioning (e.g., controlling, stratifying, or selecting) on a so-called collider variable, i.e., a variable that is itself caused by two other variables, one that is (or is associated with) the treatment and another that is (or is associated with) the outcome. Endogenous selection bias can result from direct conditioning on the outcome variable, a post-outcome variable, a post-treatment variable, and even a pre-treatment variable. We highlight the difference between endogenous selection bias, common-cause confounding, and overcontrol bias and discuss numerous examples from social stratification, cultural sociology, social network analysis, political sociology, social demography, and the sociology of education.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Alderman H, Behrman J, Kohler H, Maluccio JA, Watkins SC. 2001. Attrition in longitudinal household survey data: some tests for three developing-country samples. Demogr. Res. 5:79–124 [Google Scholar]
  2. Allen MP, Lincoln A. 2004. Critical discourse and the cultural consecration of American films. Soc. Forces 82:3871–94 [Google Scholar]
  3. Alwin DH, Hauser RM. 1975. The decomposition of effects in path analysis. Am. Sociol. Rev. 40:37–47 [Google Scholar]
  4. Amin V. 2011. Returns to education: evidence from UK twins: comment. Am. Econ. Rev. 101:41629–35 [Google Scholar]
  5. Angrist JD, Imbens GW, Rubin DB. 1996. Identification of causal effects using instrumental variables. J. Am. Stat. Assoc. 8:328–36 [Google Scholar]
  6. Angrist JD, Krueger AB. 1999. Empirical strategies in labor economics. Handbook of Labor Economics 3, ed . O Ashenfelter, D Card 1277–366 Amsterdam: Elsevier [Google Scholar]
  7. Bareinboim E, Pearl J. 2012. Controlling selection bias in causal inference. UCLA Cogn. Syst. Lab., Tech. Rep. R-381. Proc. 15th Int. Conf. Artif. Intell. Stat.(AISTATS), April 21–23, 2012, La Palma, Canary Islands N Lawrence, M Girolami 22100–8 Brookline, MA: Microtome http://ftp.cs.ucla.edu/pub/stat_ser/r381.pdf [Google Scholar]
  8. Baron RM, Kenny DA. 1986. The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. J. Personal. Soc. Psychol. 51:1173–82 [Google Scholar]
  9. Behr A, Bellgardt E, Rendtel U. 2005. Extent and determinants of panel attrition in the European Community Household Panel. Eur. Sociol. Rev. 21:5489–512 [Google Scholar]
  10. Berk RA. 1983. An introduction to sample selection bias in sociological data. Am. Sociol. Rev. 48:3386–98 [Google Scholar]
  11. Berkson J. 1946. Limitations of the application of fourfold tables to hospital data. Biometr. Bull. 2:347–53 [Google Scholar]
  12. Blalock H. 1964. Causal Inferences in Nonexperimental Research Chapel Hill: Univ. N.C. Press
  13. Bollen KA. 1989. Structural Equations with Latent Variables New York: Wiley
  14. Bollen KA, Pearl J. 2013. Eight myths about causality and structural equation models. Handbook of Causal Analysis for Social Research SL Morgan 301–28 Dordrecht, Neth: Springer [Google Scholar]
  15. Christofides LN, Li Q, Liu Z, Min I. 2003. Recent two-stage sample selection procedures with an application to the gender wage gap. J. Bus. Econ. Stat. 21:396–405 [Google Scholar]
  16. Cole SR, Hernán MA. 2002. Fallibility in the estimation of direct effects. Int. J. Epidemiol. 31:163–65 [Google Scholar]
  17. Coleman JS, Hoffer T, Kilgore S. 1982. High School Achievement: Public, Catholic, and Private Schools Compared New York: Basic Books
  18. Duncan OD. 1966. Path analysis: sociological examples. Am J. Sociol. 72:11–16 [Google Scholar]
  19. Elwert F. 2013. Graphical causal models. Handbook of Causal Analysis for Social Research SL Morgan 245–73 Dordrecht, Neth: Springer [Google Scholar]
  20. Elwert F, Christakis NA. 2008. Wives and ex-wives: a new test for homogamy bias in the widowhood effect. Demography 45:4851–73 [Google Scholar]
  21. Farr W. 1858. Influence of marriage on the mortality of the French people. Transaction National Association Promotion Social Science GW Hastings 504–13 London: John W. Park & Son [Google Scholar]
  22. Finn JD, Gerber SB, Boyd-Zaharias J. 2005. Small classes in the early grades, academic achievement, and graduating from high school. J. Educ. Psychol. 97:2214–23 [Google Scholar]
  23. Frangakis CE, Rubin DB. 2002. Principal stratification in causal inference. Biometrics 58:21–29 [Google Scholar]
  24. Fu V, Winship C, Mare R. 2004. Sample selection bias models. Handbook of Data Analysis M Hardy, A Bryman 409–30 London: Sage [Google Scholar]
  25. Gangl M, Ziefle A. 2009. Motherhood, labor force behavior, and women's careers: An empirical assessment of the wage penalty for motherhood in Britain, Germany, and the United States. Demography 46:2341–69 [Google Scholar]
  26. Glymour MM, Greenland S. 2008. Causal diagrams. Modern Epidemiology KJ Rothman, S Greenland, T Lash 183–209 Philadelphia: Lippincott, 3rd ed.. [Google Scholar]
  27. Grasdal A. 2001. The performance of sample selection estimators to control for attrition bias. Health Econ. 10:5385–98 [Google Scholar]
  28. Greenland S. 2003. Quantifying biases in causal models: classical confounding versus collider-stratification bias. Epidemiology 14:300–6 [Google Scholar]
  29. Greenland S, Pearl J, Robins JM. 1999a. Causal diagrams for epidemiologic research. Epidemiology 10:37–48 [Google Scholar]
  30. Greenland S, Robins JM. 1986. Identifiability, exchangeability and epidemiological confounding. Int. J. Epidemiol. 15:413–19 [Google Scholar]
  31. Greenland S, Robins JM, Pearl J. 1999b. Confounding and collapsibility in causal inference. Stat. Sci. 14:29–46 [Google Scholar]
  32. Griliches Z, Mason WM. 1972. Education, income, and ability. J. Polit. Econ. 80:3S74–103 [Google Scholar]
  33. Gronau R. 1974. Wage comparisons—a selectivity bias. J. Polit. Econ. 82:1119–44 [Google Scholar]
  34. Gullickson A. 2006. Education and black-white interracial marriage. Demography 43:4673–89 [Google Scholar]
  35. Hausman JA, Wise DA. 1977. Social experimentation, truncated distributions and efficient estimation. Econometrica 45:919–38 [Google Scholar]
  36. Hausman JA, Wise DA. 1981. Stratification on endogenous variables and estimation. The Econometrics of Discrete Data C Manski, D McFadden 365–91 Cambridge, MA: MIT Press [Google Scholar]
  37. Heckman JJ. 1974. Shadow prices, market wages and labor supply. Econometrica 42:4679–94 [Google Scholar]
  38. Heckman JJ. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. Ann. Econ. Soc. Meas. 5:475–92 [Google Scholar]
  39. Heckman JJ. 1979. Selection bias as a specification error. Econometrica 47:153–61 [Google Scholar]
  40. Hernán MA, Hernández-Diaz S, Robins JM. 2004. A structural approach to section bias. Epidemiology 15:615–25 [Google Scholar]
  41. Hernán MA, Hernández-Diaz S, Werler MM, Robins JM, Mitchell AA. 2002. Causal knowledge as a prerequisite of confounding evaluation: an application to birth defects epidemiology. Am. J. Epidemiol. 155:2176–84 [Google Scholar]
  42. Hill DH. 1997. Adjusting for attrition in event-history analysis. Sociol. Methodol. 27:393–416 [Google Scholar]
  43. Holland PW. 1986. Statistics and causal inference (with discussion). J. Am. Stat. Assoc. 81:945–70 [Google Scholar]
  44. Holland PW. 1988. Causal inference, path analysis, and recursive structural equation models. Sociol. Methodol. 18:449–84 [Google Scholar]
  45. Hudson JI, Javaras KN, Laird NM, VanderWeele TJ, Pope HG. et al. 2008. A structural approach to the familial coaggregation of disorders. Epidemiology 19:431–39 [Google Scholar]
  46. Imai K, Keele L, Yamamoto T. 2010. Identification, inference, and sensitivity analysis for causal mediation effects. Stat. Sci. 25:151–71 [Google Scholar]
  47. Kaufman S, Kaufman J, MacLenose R. 2009. Analytic bounds on causal risk differences in directed acyclic graphs involving three observed binary variables. J. Stat. Plan. Inference 139:3473–87 [Google Scholar]
  48. Kim JH, Pearl J. 1983. A computational model for combined causal and diagnostic reasoning in inference systems. Proc. 8th Int. Jt. Conf. Artif. Intell. (IJCAI-83), Karlsruhe, FRG, Aug. 8–12, 1983 A Bundy 190–93 San Francisco: Morgan Kaufmann [Google Scholar]
  49. Leigh A, Ryan C. 2008. Estimating returns to education using different natural experiment techniques. Econ. Educ. Rev. 27:2149–60 [Google Scholar]
  50. Lin I, Schaeffer NC, Seltzer JA. 1999. Causes and effects of nonparticipation in a child support survey. J. Off. Stat. 15:2143–66 [Google Scholar]
  51. Manski C. 2003. Partial Identification of Probability Distributions New York: Springer
  52. Morgan SL, Winship C. 2007. Counterfactuals and Causal Inference: Methods and Principles for Social Research New York: Cambridge Univ. Press
  53. O'Malley AJ, Elwert F, Rosenquist JN, Zaslavsky AM, Christakis NA. 2014. Estimating peer effects in longitudinal dyadic data using instrumental variables. Biometrics In press. doi: 10.1111/biom.12172
  54. Pagan A, Ullah A. 1997. Nonparametric Econometrics Cambridge, UK: Cambridge Univ. Press
  55. Pearl J. 1988. Probabilistic Reasoning in Intelligent Systems San Mateo, CA: Morgan Kaufmann
  56. Pearl J. 1995. Causal diagrams for empirical research. Biometrika 82:4669–710 [Google Scholar]
  57. Pearl J. 1998. Graphs, causality, and structural equation models. Sociol. Methods Res. 27:2226–84 [Google Scholar]
  58. Pearl J. 2001. Direct and indirect effects. Proc. 17th Conf. Uncertain. Artif. Intell., Aug. 2–5, 2001, Seattle, WA J Breese, D Koller 411–20 San Francisco: Morgan Kaufmann [Google Scholar]
  59. Pearl J. 2009. Causality: Models, Reasoning, and Inference New York: Cambridge Univ. Press, 2nd ed..
  60. Pearl J. 2010. The foundations of causal inference. Sociol. Methodol. 40:75–149 [Google Scholar]
  61. Pearl J. 2012. The causal mediation formula—a guide to the assessment of pathways and mechanisms. Prev. Sci. 13:4426–36 [Google Scholar]
  62. Pearl J, Robins JM. 1995. Probabilistic evaluation of sequential plans from causal models with hidden variables. Uncertainty in Artificial Intelligence 11 P Besnard, S Hanks 444–53 San Francisco: Morgan Kaufmann [Google Scholar]
  63. Raymo JM, Iwasawa M. 2005. Marriage market mismatches in Japan: an alternative view of the relationship between women's education and marriage. Am. Sociol. Rev. 70:5801–22 [Google Scholar]
  64. Robins JM. 1986. A new approach to causal inference in mortality studies with a sustained exposure period: application to the health worker survivor effect. Math. Model. 7:1393–512 [Google Scholar]
  65. Robins JM. 1989. The control of confounding by intermediate variables. Stat. Med. 8:679–701 [Google Scholar]
  66. Robins JM. 1994. Correcting for non-compliance in randomized trials using structural nested mean models. Commun. Stat.-Theory Methods 23:2379–412 [Google Scholar]
  67. Robins JM. 1999. Association, causation, and marginal structural models. Synthese 121:151–79 [Google Scholar]
  68. Robins JM. 2001. Data, design, and background knowledge in etiologic inference. Epidemiology 23:3313–20 [Google Scholar]
  69. Robins JM. 2003. Semantics of causal DAG models and the identification of direct and indirect effects. Highly Structured Stochastic Systems P Green, NL Hjort, S Richardson 70–81 New York: Oxford Univ. Press [Google Scholar]
  70. Robins JM, Greenland S. 1992. Identifiability and exchangeability for direct and indirect effects. Epidemiology 3:143–55 [Google Scholar]
  71. Robins JM, Wasserman L. 1999. On the impossibility of inferring causation from association without background knowledge. Computation, Causation, and Discovery CN Glymour, GG Cooper 305–21 Cambridge, MA: AAAI/MIT Press [Google Scholar]
  72. Rosenbaum PR. 1984. The consequences of adjustment for a concomitant variable that has been affected by the treatment. J. R. Stat. Soc. Ser. A 147:5656–66 [Google Scholar]
  73. Rothman KJ, Greenland S, Lash TL. 2008. Case-control studies. Modern Epidemiology KJ Rothman, S Greenland, TL Lash 111–27 Philadelphia, PA: Lippincott, 3rd ed.. [Google Scholar]
  74. Rubin DB. 1974. Estimating causal effects of treatments in randomized and non-randomized studies. J. Educ. Psychol. 66:688–701 [Google Scholar]
  75. Schmutz V. 2005. Retrospective cultural consecration in popular music. Am. Behav. Sci. 48:111510–23 [Google Scholar]
  76. Schmutz V, Faupel A. 2010. Gender and cultural consecration in popular music. Soc. Forces 89:2685–708 [Google Scholar]
  77. Shalizi CR, Thomas AC. 2011. Homophily and contagion are generically confounded in observational social network studies. Sociol. Methods Res. 40:211–39 [Google Scholar]
  78. Sharkey P, Elwert F. 2011. The legacy of disadvantage: multigenerational neighborhood effects on cognitive ability. Am. J. Sociol. 116:61934–81 [Google Scholar]
  79. Shpitser I, VanderWeele TJ. 2011. A complete graphical criterion for the adjustment formula in mediation analysis. Int. J. Biostat. 7:16 [Google Scholar]
  80. Smith HL. 1990. Specification problems in experimental and nonexperimental social research. Sociol. Methodol. 20:59–91 [Google Scholar]
  81. Sobel ME. 2008. Identification of causal parameters in randomized studies with mediating variables. J. Educ. Behav. Stat. 33:2230–51 [Google Scholar]
  82. Spirtes P, Glymour CN, Scheines R. 2000. Causation, Prediction, and Search New York: Springer, 2nd ed..
  83. Steiner PM, Cook TD, Shadish WR, Clark MH. 2010. The importance of covariate selection in controlling for selection bias in observational studies. Psychol. Methods 15:3250–67 [Google Scholar]
  84. Stolzenberg RM, Relles DA. 1997. Tools for intuition about sample selection bias and its correction. Am. Sociol. Rev. 62:3494–507 [Google Scholar]
  85. VanderWeele TJ. 2008a. Simple relations between principal stratification and direct and indirect effects. Stat. Probab. Lett. 78:2957–62 [Google Scholar]
  86. VanderWeele TJ. 2008b. The sign of the bias of unmeasured confounding. Biometrics 64:702–6 [Google Scholar]
  87. VanderWeele TJ. 2009a. Marginal structural models for the estimation of direct and indirect effects. Epidemiology 20:18–26 [Google Scholar]
  88. VanderWeele TJ. 2009b. Mediation and mechanism. Eur. J. Epidemiol. 24:217–24 [Google Scholar]
  89. VanderWeele TJ. 2010. Bias formulas for sensitivity analysis for direct and indirect effects. Epidemiology 21:540–51 [Google Scholar]
  90. VanderWeele TJ. 2011a. Causal mediation analysis with survival data. Epidemiology 22:582–85 [Google Scholar]
  91. VanderWeele TJ. 2011b. Sensitivity analysis for contagion effects in social networks. Sociol. Methods Res. 40:240–55 [Google Scholar]
  92. VanderWeele TJ, An W. 2013. Social networks and causal inference. Handbook of Causal Analysis for Social Research SL Morgan 353–74 Dordrecht, Neth: Springer [Google Scholar]
  93. VanderWeele TJ, Hernán MA, Robins JM. 2008. Causal directed acyclic graphs and the direction of unmeasured confounding bias. Epidemiology 19:720–28 [Google Scholar]
  94. VanderWeele TJ, Robins JM. 2007a. Directed acyclic graphs, sufficient causes, and the properties of conditioning on a common effect. Am. J. Epidemiol. 166:91096–104 [Google Scholar]
  95. VanderWeele TJ, Robins JM. 2007b. Four types of effect modification: a classification based on directed acyclic graphs. Epidemiology 18:5561–68 [Google Scholar]
  96. VanderWeele TJ, Robins JM. 2009a. Minimal sufficient causation and directed acyclic graphs. Ann. Stat. 37:1437–65 [Google Scholar]
  97. VanderWeele TJ, Robins JM. 2009b. Properties of monotonic effects on directed acyclic graphs. J. Mach. Learn. 10:699–718 [Google Scholar]
  98. VanderWeele TJ, Shpitser I. 2011. A new criterion for confounder selection. Biometrics 67:1406–13 [Google Scholar]
  99. Vella F. 1998. Estimating models with sample selection bias: A survey. J. Hum. Resour. 33:127–69 [Google Scholar]
  100. Ver Steeg G, Galstyan A. 2011. A sequence of relaxations constraining hidden variable models. Proc. 27th Conf. Uncertain. Artif. Intell.(UAI2011), July 14–17, 2011, Barcelona, Spain FG Cozman, A Pfeffer 717–726 Corvalis, OR: AUAI Press [Google Scholar]
  101. Verma T, Pearl J. 1988. Causal networks: semantics and expressiveness. Proc. 4th Workshop Uncertain. Artif. Intell.352–59 Minneapolis, MN/Mountain View, CA: AUAI Press [Google Scholar]
  102. Weinberg CR. 1993. Towards a clearer definition of confounding. Am. J. Epidemiol. 137:1–8 [Google Scholar]
  103. Winship C, Korenman S. 1997. Does staying in school make you smarter? The effect of education on IQ in The Bell Curve. Intelligence, Genes, and Success: Scientists Respond to The Bell Curve, ed. B Devlin, SE Fienberg, DP Resnick, K Roeder 215–34 New York: Springer [Google Scholar]
  104. Winship C, Mare RD. 1992. Models for sample selection bias. Annu. Rev. Sociol. 18:327–50 [Google Scholar]
  105. Wodtke G, Harding D, Elwert F. 2011. Neighborhood effects in temporal perspective: the impact of long-term exposure to concentrated disadvantage on high school graduation. Am. Sociol. Rev. 76:713–36 [Google Scholar]
  106. Wooldridge J. 2002. Econometric Analysis of Cross Section and Panel Data Cambridge, MA: MIT Press
  107. Wooldridge J. 2005. Violating ignorability of treatment by controlling for too many factors. Econ. Theory 21:1026–28 [Google Scholar]
  108. Wright S. 1934. The method of path coefficients. Ann. Math. Stat. 5:3161–215 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error