- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 9, 2022
Annual Review of Statistics and Its Application - Volume 9, 2022
Volume 9, 2022
-
-
Perspective on Data Science
Vol. 9 (2022), pp. 1–20More LessThe field of data science currently enjoys a broad definition that includes a wide array of activities which borrow from many other established fields of study. Having such a vague characterization of a field in the early stages might be natural, but over time maintaining such a broad definition becomes unwieldy and impedes progress. In particular, the teaching of data science is hampered by the seeming need to cover many different points of interest. Data scientists must ultimately identify the core of the field by determining what makes the field unique and what it means to develop new knowledge in data science. In this review we attempt to distill some core ideas from data science by focusing on the iterative process of data analysis and develop some generalizations from past experience. Generalizations of this nature could form the basis of a theory of data science and would serve to unify and scale the teaching of data science to large audiences.
-
-
-
Is There a Cap on Longevity? A Statistical Review
Vol. 9 (2022), pp. 21–45More LessThere is sustained and widespread interest in understanding the limit, if there is any, to the human life span. Apart from its intrinsic and biological interest, changes in survival in old age have implications for the sustainability of social security systems. A central question is whether the endpoint of the underlying lifetime distribution is finite. Recent analyses of data on the oldest human lifetimes have led to competing claims about survival and to some controversy, due in part to incorrect statistical analysis. This article discusses the particularities of such data, outlines correct ways of handling them, and presents suitable models and methods for their analysis. We provide a critical assessment of some earlier work and illustrate the ideas through reanalysis of semisupercentenarian lifetime data. Our analysis suggests that remaining life length after age 109 is exponentially distributed and that any upper limit lies well beyond the highest lifetime yet reliably recorded. Lower limits to 95% confidence intervals for the human life span are about 130 years, and point estimates typically indicate no upper limit at all.
-
-
-
A Practical Guide to Family Studies with Lifetime Data
Vol. 9 (2022), pp. 47–69More LessFamilial aggregation refers to the fact that a particular disease may be overrepresented in some families due to genetic or environmental factors. When studying such phenomena, it is clear that one important aspect is the age of onset of the disease in question, and in addition, the data will typically be right-censored. Therefore, one must apply lifetime data methods to quantify such dependence and to separate it into different sources using polygenic modeling. Another important point is that the occurrence of a particular disease can be prevented by death—that is, competing risks—and therefore, the familial aggregation should be studied in a model that allows for both death and the occurrence of the disease. We here demonstrate how polygenic modeling can be done for both survival data and competing risks data dealing with right-censoring. The competing risks modeling that we focus on is closely related to the liability threshold model.
-
-
-
Sibling Comparison Studies
Vol. 9 (2022), pp. 71–94More LessUnmeasured confounding is one of the main sources of bias in observational studies. A popular way to reduce confounding bias is to use sibling comparisons, which implicitly adjust for several factors in the early environment or upbringing without requiring them to be measured or known. In this article we provide a broad exposition of the statistical analysis methods for sibling comparison studies. We further discuss a number of methodological challenges that arise in sibling comparison studies.
-
-
-
Value of Information Analysis in Models to Inform Health Policy
Vol. 9 (2022), pp. 95–118More LessValue of information (VoI) is a decision-theoretic approach to estimating the expected benefits from collecting further information of different kinds, in scientific problems based on combining one or more sources of data. VoI methods can assess the sensitivity of models to different sources of uncertainty and help to set priorities for further data collection. They have been widely applied in healthcare policy making, but the ideas are general to a range of evidence synthesis and decision problems. This article gives a broad overview of VoI methods, explaining the principles behind them, the range of problems that can be tackled with them, and how they can be implemented, and discusses the ongoing challenges in the area.
-
-
-
Recent Challenges in Actuarial Science
Vol. 9 (2022), pp. 119–140More LessFor centuries, mathematicians and, later, statisticians, have found natural research and employment opportunities in the realm of insurance. By definition, insurance offers financial cover against unforeseen events that involve an important component of randomness, and consequently, probability theory and mathematical statistics enter insurance modeling in a fundamental way. In recent years, a data deluge, coupled with ever-advancing information technology and the birth of data science, has revolutionized or is about to revolutionize most areas of actuarial science as well as insurance practice. We discuss parts of this evolution and, in the case of non-life insurance, show how a combination of classical tools from statistics, such as generalized linear models and, e.g., neural networks contribute to better understanding and analysis of actuarial data. We further review areas of actuarial science where the cross fertilization between stochastics and insurance holds promise for both sides. Of course, the vastness of the field of insurance limits our choice of topics; we mainly focus on topics closer to our main areas of research.
-
-
-
Risk Measures: Robustness, Elicitability, and Backtesting
Vol. 9 (2022), pp. 141–166More LessRisk measures are used not only for financial institutions’ internal risk management but also for external regulation (e.g., in the Basel Accord for calculating the regulatory capital requirements for financial institutions). Though fundamental in risk management, how to select a good risk measure is a controversial issue. We review the literature on risk measures, particularly on issues such as subadditivity, robustness, elicitability, and backtesting. We also aim to clarify some misconceptions and confusions in the literature. In particular, we argue that, despite lacking some mathematical convenience, the median shortfall—that is, the median of the tail loss distribution—is a better option than the expected shortfall for setting the Basel Accords capital requirements due to statistical and economic considerations such as capturing tail risk, robustness, elicitability, backtesting, and surplus invariance.
-
-
-
Methods Based on Semiparametric Theory for Analysis in the Presence of Missing Data
Vol. 9 (2022), pp. 167–196More LessA statistical model is a class of probability distributions assumed to contain the true distribution generating the data. In parametric models, the distributions are indexed by a finite-dimensional parameter characterizing the scientific question of interest. Semiparametric models describe the distributions in terms of a finite-dimensional parameter and an infinite-dimensional component, offering more flexibility. Ordinarily, the statistical model represents distributions for the full data intended to be collected. When elements of these full data are missing, the goal is to make valid inference on the full-data-model parameter using the observed data. In a series of fundamental works, Robins, Rotnitzky, and colleagues derived the class of observed-data estimators under a semiparametric model assuming that the missingness mechanism is at random, which leads to practical, robust methodology for many familiar data-analytic challenges. This article reviews semiparametric theory and the key steps in this derivation.
-
-
-
Current Advances in Neural Networks
Vol. 9 (2022), pp. 197–222More LessThis article reviews current advances and developments in neural networks. This requires recalling some of the earlier work in the field. We emphasize Bayesian approaches and their benefits compared to more standard maximum likelihood treatments. Several representative experiments using varied modern neural architectures are presented.
-
-
-
Framing Causal Questions in Life Course Epidemiology
Vol. 9 (2022), pp. 223–248More LessWe describe the principles of counterfactual thinking in providing more precise definitions of causal effects and some of the implications of this work for the way in which causal questions in life course research are framed and evidence evaluated. Terminology is explained and examples of common life course analyses are discussed that focus on the timing of exposures, the mediation of their effects, observed and unobserved confounders, and measurement error. The examples are illustrated by analyses using singleton and twin cohort data.
-
-
-
Causality and the Cox Regression Model
Vol. 9 (2022), pp. 249–259More LessThis article surveys results concerning the interpretation of the Cox hazard ratio in connection to causality in a randomized study with a time-to-event response. The Cox model is assumed to be correctly specified, and we investigate whether the typical end product of such an analysis, the estimated hazard ratio, has a causal interpretation as a hazard ratio. It has been pointed out that this is not possible due to selection. We provide more insight into the interpretation of hazard ratios and differences, investigating what can be learned about a treatment effect from the hazard ratio approaching unity after a certain period of time. The conclusion is that the Cox hazard ratio is not causally interpretable as a hazard ratio unless there is no treatment effect or an untestable and unrealistic assumption holds. We give a hazard ratio that has a causal interpretation and study its relationship to the Cox hazard ratio.
-
-
-
Effects of Causes and Causes of Effects
Vol. 9 (2022), pp. 261–287More LessWe describe and contrast two distinct problem areas for statistical causality: studying the likely effects of an intervention (effects of causes) and studying whether there is a causal link between the observed exposure and outcome in an individual case (causes of effects). For each of these, we introduce and compare various formal frameworks that have been proposed for that purpose, including the decision-theoretic approach, structural equations, structural and stochastic causal models, and potential outcomes. We argue that counterfactual concepts are unnecessary for studying effects of causes but are needed for analyzing causes of effects. They are, however, subject to a degree of arbitrariness, which can be reduced, though not in general eliminated, by taking account of additional structure in the problem.
-
-
-
Granger Causality: A Review and Recent Advances
Ali Shojaie, and Emily B. FoxVol. 9 (2022), pp. 289–319More LessIntroduced more than a half-century ago, Granger causality has become a popular tool for analyzing time series data in many application domains, from economics and finance to genomics and neuroscience. Despite this popularity, the validity of this framework for inferring causal relationships among time series has remained the topic of continuous debate. Moreover, while the original definition was general, limitations in computational tools have constrained the applications of Granger causality to primarily simple bivariate vector autoregressive processes. Starting with a review of early developments and debates, this article discusses recent advances that address various shortcomings of the earlier approaches, from models for high-dimensional time series to more recent developments that account for nonlinear and non-Gaussian observations and allow for subsampled and mixed-frequency time series.
-
-
-
Score-Driven Time Series Models
Vol. 9 (2022), pp. 321–342More LessThe construction of score-driven filters for nonlinear time series models is described, and they are shown to apply over a wide range of disciplines. Their theoretical and practical advantages over other methods are highlighted. Topics covered include robust time series modeling, conditional heteroscedasticity, count data, dynamic correlation and association, censoring, circular data, and switching regimes.
-
-
-
A Variational View on Statistical Multiscale Estimation
Vol. 9 (2022), pp. 343–372More LessWe present a unifying view on various statistical estimation techniques including penalization, variational, and thresholding methods. These estimators are analyzed in the context of statistical linear inverse problems including nonparametric and change point regression, and high-dimensional linear models as examples. Our approach reveals many seemingly unrelated estimation schemes as special instances of a general class of variational multiscale estimators, called MIND (multiscale Nemirovskii–Dantzig). These estimators result from minimizing certain regularization functionals under convex constraints that can be seen as multiple statistical tests for local hypotheses. For computational purposes, we recast MIND in terms of simpler unconstraint optimization problems via Lagrangian penalization as well as Fenchel duality. Performance of several MINDs is demonstrated on numerical examples.
-
-
-
Basis-Function Models in Spatial Statistics
Vol. 9 (2022), pp. 373–400More LessSpatial statistics is concerned with the analysis of data that have spatial locations associated with them, and those locations are used to model statistical dependence between the data. The spatial data are treated as a single realization from a probability model that encodes the dependence through both fixed effects and random effects, where randomness is manifest in the underlying spatial process and in the noisy, incomplete measurement process. The focus of this review article is on the use of basis functions to provide an extremely flexible and computationally efficient way to model spatial processes that are possibly highly nonstationary. Several examples of basis-function models are provided to illustrate how they are used in Gaussian, non-Gaussian, multivariate, and spatio-temporal settings, with applications in geophysics. Our aim is to emphasize the versatility of these spatial-statistical models and to demonstrate that they are now center-stage in a number of application domains. The review concludes with a discussion and illustration of software currently available to fit spatial-basis-function models and implement spatial-statistical prediction.
-
-
-
Measure Transportation and Statistical Decision Theory
Vol. 9 (2022), pp. 401–424More LessUnlike the real line, the real space, in dimension d ≥ 2, is not canonically ordered. As a consequence, extending to a multivariate context fundamental univariate statistical tools such as quantiles, signs, and ranks is anything but obvious. Tentative definitions have been proposed in the literature but do not enjoy the basic properties (e.g., distribution-freeness of ranks, their independence with respect to the order statistic, their independence with respect to signs) they are expected to satisfy. Based on measure transportation ideas, new concepts of distribution and quantile functions, ranks, and signs have been proposed recently that, unlike previous attempts, do satisfy these properties. These ranks, signs, and quantiles have been used, quite successfully, in several inference problems and have triggered, in a short span of time, a number of applications: fully distribution-free testing for multiple-output regression, MANOVA, and VAR models; R-estimation for VARMA parameters; distribution-free testing for vector independence; multiple-output quantile regression; nonlinear independent component analysis; and so on.
-
-
-
Discrete Latent Variable Models
Vol. 9 (2022), pp. 425–452More LessWe review the discrete latent variable approach, which is very popular in statistics and related fields. It allows us to formulate interpretable and flexible models that can be used to analyze complex datasets in the presence of articulated dependence structures among variables. Specific models including discrete latent variables are illustrated, such as finite mixture, latent class, hidden Markov, and stochastic block models. Algorithms for maximum likelihood and Bayesian estimation of these models are reviewed, focusing, in particular, on the expectation–maximization algorithm and the Markov chain Monte Carlo method with data augmentation. Model selection, particularly concerning the number of support points of the latent distribution, is discussed. The approach is illustrated by summarizing applications available in the literature; a brief review of the main software packages to handle discrete latent variable models is also provided. Finally, some possible developments in this literature are suggested.
-
-
-
Vine Copula Based Modeling
Vol. 9 (2022), pp. 453–477More LessWith the availability of massive multivariate data comes a need to develop flexible multivariate distribution classes. The copula approach allows marginal models to be constructed for each variable separately and joined with a dependence structure characterized by a copula. The class of multivariate copulas was limited for a long time to elliptical (including the Gaussian and t-copula) and Archimedean families (such as Clayton and Gumbel copulas). Both classes are rather restrictive with regard to symmetry and tail dependence properties. The class of vine copulas overcomes these limitations by building a multivariate model using only bivariate building blocks. This gives rise to highly flexible models that still allow for computationally tractable estimation and model selection procedures. These features made vine copula models quite popular among applied researchers in numerous areas of science. This article reviews the basic ideas underlying these models, presents estimation and model selection approaches, and discusses current developments and future directions.
-
-
-
Quantum Computing in a Statistical Context
Yazhen Wang, and Hongzhi LiuVol. 9 (2022), pp. 479–504More LessQuantum computing is widely considered a frontier of interdisciplinary research and involves fields ranging from computer science to physics and from chemistry to engineering. On the one hand, the stochastic essence of quantum physics results in the random nature of quantum computing; thus, there is an important role for statistics to play in the development of quantum computing. On the other hand, quantum computing has great potential to revolutionize computational statistics and data science. This article provides an overview of the statistical aspect of quantum computing. We review the basic concepts of quantum computing and introduce quantum research topics such as quantum annealing and quantum machine learning, which require statistics to be understood.
-