- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 10, 2023
Annual Review of Statistics and Its Application - Volume 10, 2023
Volume 10, 2023
-
-
Fifty Years of the Cox Model
Vol. 10 (2023), pp. 1–23More LessThe Cox model is now 50 years old. The seminal paper of Sir David Cox has had an immeasurable impact on the analysis of censored survival data, with applications in many different disciplines. This work has also stimulated much additional research in diverse areas and led to important theoretical and practical advances. These include semiparametric models, nonparametric efficiency, and partial likelihood. In addition to quickly becoming the go-to method for estimating covariate effects, Cox regression has been extended to a vast number of complex data structures, to all of which the central idea of sampling from the set of individuals at risk at time t can be applied. In this article, we review the Cox paper and the evolution of the ideas surrounding it. We then highlight its extensions to competing risks, with attention to models based on cause-specific hazards, and to hazards associated with the subdistribution or cumulative incidence function. We discuss their relative merits and domains of application. The analysis of recurrent events is another major topic of discussion, including an introduction to martingales and complete intensity models as well as the more practical marginal rate models. We include several worked examples to illustrate the main ideas.
-
-
-
High-Dimensional Survival Analysis: Methods and Applications
Stephen Salerno, and Yi LiVol. 10 (2023), pp. 25–49More LessIn the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to overfitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. In this article, we review various cutting-edge methods that handle survival outcome data with high-dimensional predictors, highlighting recent innovations in machine learning approaches for survival prediction. We cover the statistical intuitions and principles behind these methods and conclude with extensions to more complex settings, where competing events are observed. We exemplify these methods with applications to the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.
-
-
-
Shared Frailty Methods for Complex Survival Data: A Review of Recent Advances
Vol. 10 (2023), pp. 51–73More LessDependent survival data arise in many contexts. One context is clustered survival data, where survival data are collected on clusters such as families or medical centers. Dependent survival data also arise when multiple survival times are recorded for each individual. Frailty models are one common approach to handle such data. In frailty models, the dependence is expressed in terms of a random effect, called the frailty. Frailty models have been used with both the Cox proportional hazards model and the accelerated failure time model. This article reviews recent developments in the area of frailty models in a variety of settings. In each setting we provide a detailed model description, assumptions, available estimation methods, and R packages.
-
-
-
Surrogate Endpoints in Clinical Trials
Vol. 10 (2023), pp. 75–96More LessSurrogate markers are often used in clinical trials settings when obtaining a final outcome to evaluate the effectiveness of a treatment requires a long wait, is expensive to obtain, or both. Formal definitions of surrogate marker quality resulting from a large variety of estimation approaches have been proposed over the years. I review this work, with a particular focus on approaches that use the causal inference paradigm, as these conceptualize a good marker as one in the causal pathway between the treatment and outcome. I also focus on efforts to evaluate the risk of a surrogate paradox, a damaging situation where the surrogate is positively associated with the outcome, and the causal effect of the treatment on the surrogate is in a helpful direction, but the ultimate causal effect of the treatment on the outcome is harmful. I then review some recent work in robust surrogate marker estimation and conclude with a discussion and suggestions for future research.
-
-
-
Sustainable Statistical Capacity-Building for Africa: The Biostatistics Case
Vol. 10 (2023), pp. 97–117More LessSeveral major global challenges, including climate change and water scarcity, warrant a scientific approach to generating solutions. Developing high quality and robust capacity in (bio)statistics is key to ensuring sound scientific solutions to these challenges, so collaboration between academic and research institutes should be high on university agendas. To strengthen capacity in the developing world, South–North partnerships should be a priority. The ideas and examples of statistical capacity-building presented in this article are the result of several monthly online discussions between a mixedgroup of authors having international experience and formal links with Hasselt University in Belgium. The discussion focuses on statistical capacity-building through education (teaching), research, and societal impact. We have adopted an example-based approach, and in view of the background of the authors, the examples refer mainly to biostatistical capacity-building. Although many universities worldwide have already initiated university collaborations for development, we hope and believe that our ideas and concrete examples can serve as inspiration to further strengthen South–North partnerships on statistical capacity-building.
-
-
-
Confidentiality Protection in the 2020 US Census of Population and Housing
Vol. 10 (2023), pp. 119–144More LessIn an era where external data and computational capabilities far exceed statistical agencies’ own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods are too fragile to address this new challenge. This article discusses the deployment of a differential privacy framework for the 2020 US Census that was customized to protect confidentiality, particularly the most detailed geographic and demographic categories, and deliver controlled accuracy across the full geographic hierarchy.
-
-
-
The Role of Statistics in Promoting Data Reusability and Research Transparency
Vol. 10 (2023), pp. 145–164More LessThe value of research data has grown as the emphasis on research transparency and data-intensive research has increased. Data sharing is now required by funders and publishers and is becoming a disciplinary expectation in many fields. However, practices promoting data reusability and research transparency are poorly understood, making it difficult for statisticians and other researchers to reframe study methods to facilitate data sharing. This article reviews the larger landscape of open research and describes contextual information that data reusers need to understand, evaluate, and appropriately analyze shared data. The article connects data reusability to statistical thinking by considering the impact of the type and quality of shared research artifacts on the capacity to reproduce or replicate studies and examining quality evaluation frameworks to understand the nature of data errors and how they can be mitigated prior to sharing. Actions statisticians can take to update their research approaches for their own and collaborative investigations are suggested.
-
-
-
Fair Risk Algorithms
Vol. 10 (2023), pp. 165–187More LessMachine learning algorithms are becoming ubiquitous in modern life. When used to help inform human decision making, they have been criticized by some for insufficient accuracy, an absence of transparency, and unfairness. Many of these concerns can be legitimate, although they are less convincing when compared with the uneven quality of human decisions. There is now a large literature in statistics and computer science offering a range of proposed improvements. In this article, we focus on machine learning algorithms used to forecast risk, such as those employed by judges to anticipate a convicted offender's future dangerousness and by physicians to help formulate a medical prognosis or ration scarce medical care. We review a variety of conceptual, technical, and practical features common to risk algorithms and offer suggestions for how their development and use might be meaningfully advanced. Fairness concerns are emphasized.
-
-
-
Statistical Data Privacy: A Song of Privacy and Utility
Vol. 10 (2023), pp. 189–218More LessTo quantify trade-offs between increasing demand for open data sharing and concerns about sensitive information disclosure, statistical data privacy (SDP) methodology analyzes data release mechanisms that sanitize outputs based on confidential data. Two dominant frameworks exist: statistical disclosure control (SDC) and the more recent differential privacy (DP). Despite framing differences, both SDC and DP share the same statistical problems at their core. For inference problems, either we may design optimal release mechanisms and associated estimators that satisfy bounds on disclosure risk measures, or we may adjust existing sanitized output to create new statistically valid and optimal estimators. Regardless of design or adjustment, in evaluating risk and utility, valid statistical inferences from mechanism outputs require uncertainty quantification that accounts for the effect of the sanitization mechanism that introduces bias and/or variance. In this review, we discuss the statistical foundations common to both SDC and DP, highlight major developments in SDP, and present exciting open research problems in private inference.
-
-
-
A Brief Tour of Deep Learning from a Statistical Perspective
Vol. 10 (2023), pp. 219–246More LessWe expose the statistical foundations of deep learning with the goal of facilitating conversation between the deep learning and statistics communities. We highlight core themes at the intersection; summarize key neural models, such as feedforward neural networks, sequential neural networks, and neural latent variable models; and link these ideas to their roots in probability and statistics. We also highlight research directions in deep learning where there are opportunities for statistical contributions.
-
-
-
Statistical Deep Learning for Spatial and Spatiotemporal Data
Vol. 10 (2023), pp. 247–270More LessDeep neural network models have become ubiquitous in recent years and have been applied to nearly all areas of science, engineering, and industry. These models are particularly useful for data that have strong dependencies in space (e.g., images) and time (e.g., sequences). Indeed, deep models have also been extensively used by the statistical community to model spatial and spatiotemporal data through, for example, the use of multilevel Bayesian hierarchical models and deep Gaussian processes. In this review, we first present an overview of traditional statistical and machine learning perspectives for modeling spatial and spatiotemporal data, and then focus on a variety of hybrid models that have recently been developed for latent process, data, and parameter specifications. These hybrid models integrate statistical modeling ideas with deep neural network models in order to take advantage of the strengths of each modeling paradigm. We conclude by giving an overview of computational technologies that have proven useful for these hybrid models, and with a brief discussion on future research directions.
-
-
-
Statistical Machine Learning for Quantitative Finance
Vol. 10 (2023), pp. 271–295More LessWe survey the active interface of statistical learning methods and quantitative finance models. Our focus is on the use of statistical surrogates, also known as functional approximators, for learning input–output relationships relevant for financial tasks. Given the disparate terminology used among statisticians and financial mathematicians, we begin by reviewing the main ingredients of surrogate construction and the motivating financial tasks. We then summarize the major surrogate types, including (deep) neural networks, Gaussian processes, gradient boosting machines, smoothing splines, and Chebyshev polynomials. The second half of the article dives deeper into the major applications of statistical learning in finance, covering (a) parametric option pricing, (b) learning the implied/local volatility surface, (c) learning option sensitivities, (d) American option pricing, and (e) model calibration. We also briefly detail statistical learning for stochastic control and reinforcement learning, two areas of research exploding in popularity in quantitative finance.
-
-
-
Models for Integer Data
Vol. 10 (2023), pp. 297–323More LessOver the past few years, interest has increased in models defined on positive and negative integers. Several application areas lead to data that are differences between positive integers. Some important examples are price changes measured discretely in financial applications, pre- and posttreatment measurements of discrete outcomes in clinical trials, the difference in the number of goals in sports events, and differencing of count-valued time series. This review aims at bringing together a wide range of models that have appeared in the literature in recent decades. We provide an extensive review on discrete distributions defined for integer data and then consider univariate and multivariate time-series models, including the class of autoregressive models, stochastic processes, and ARCH-GARCH– (autoregressive conditionally heteroskedastic–generalized autoregressive conditionally heteroskedastic–) type models.
-
-
-
Generative Models: An Interdisciplinary Perspective
Vol. 10 (2023), pp. 325–352More LessBy linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts underlying modern generative models and then analyze how they can be used to inform experimental design, iterative model refinement, goodness-of-fit evaluation, and agent based simulation. We emphasize a modular view of generative mechanisms and discuss how they can be flexibly recombined in new problem contexts. We provide practical illustrations throughout, and code for reproducing all examples is available at https://github.com/krisrs1128/generative_review. Finally, we observe how research in generative models is currently split across several islands of activity, and we highlight opportunities lying at disciplinary intersections.
-
-
-
Data Integration in Bayesian Phylogenetics
Vol. 10 (2023), pp. 353–377More LessResearchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g., DNA), time, location (both continuous and discrete), and environmental covariates (e.g., social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increasedsize and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances, as well as unsolved problems and areas of active research.
-
-
-
Approximate Methods for Bayesian Computation
Radu V. Craiu, and Evgeny LeviVol. 10 (2023), pp. 379–399More LessRich data generating mechanisms are ubiquitous in this age of information and require complex statistical models to draw meaningful inference. While Bayesian analysis has seen enormous development in the last 30 years, benefitting from the impetus given by the successful application of Markov chain Monte Carlo (MCMC) sampling, the combination of big data and complex models conspire to produce significant challenges for the traditional MCMC algorithms. We review modern algorithmic developments addressing the latter and compare their performance using numerical experiments.
-
-
-
Simulation-Based Bayesian Analysis
Vol. 10 (2023), pp. 401–425More LessI consider the development of Markov chain Monte Carlo (MCMC) methods, from late-1980s Gibbs sampling to present-day gradient-based methods and piecewise-deterministic Markov processes. In parallel, I show how these ideas have been implemented in successive generations of statistical software for Bayesian inference. These software packages have been instrumental in popularizing applied Bayesian modeling across a wide variety of scientific domains. They provide an invaluable service to applied statisticians in hiding the complexities of MCMC from the user while providing a convenient modeling language and tools to summarize the output from a Bayesian model. As research into new MCMC methods remains very active, it is likely that future generations of software will incorporate new methods to improve the user experience.
-
-
-
High-Dimensional Data Bootstrap
Vol. 10 (2023), pp. 427–449More LessThis article reviews recent progress in high-dimensional bootstrap. We first review high-dimensional central limit theorems for distributions of sample mean vectors over the rectangles, bootstrap consistency results in high dimensions, and key techniques used to establish those results. We then review selected applications of high-dimensional bootstrap: construction of simultaneous confidence sets for high-dimensional vector parameters, multiple hypothesis testing via step-down, postselection inference, intersection bounds for partially identified parameters, and inference on best policies in policy evaluation. Finally, we also comment on a couple of future research directions.
-
-
-
Innovation Diffusion Processes: Concepts, Models, and Predictions
Vol. 10 (2023), pp. 451–473More LessInnovation diffusion processes have attracted considerable research attention for their interdisciplinary character, which combines theories and concepts from disciplines such as mathematics, physics, statistics, social sciences, marketing, economics, and technological forecasting. The formal representation of innovation diffusion processes historically used epidemic models borrowed from biology, departing from the logistic equation, under the hypothesis that an innovation spreads in a social system through communication between people like an epidemic through contagion. This review integrates basic innovation diffusion models built upon the Bass model, primarily from the marketing literature, with a number of ideas from the epidemiological literature in order to offer a different perspective on innovation diffusion by focusing on critical diffusions, which are key for the progress of human communities. The article analyzes three key issues: barriers to diffusion, centrality of word-of-mouth, and the management of policy interventions to assist beneficial diffusions and to prevent harmful ones. We focus on deterministic innovation diffusion models described by ordinary differential equations.
-
-
-
Graph-Based Change-Point Analysis
Vol. 10 (2023), pp. 475–499More LessRecent technological advances allow for the collection of massive data in the study of complex phenomena over time and/or space in various fields. Many of these data involve sequences of high-dimensional or non-Euclidean measurements, where change-point analysis is a crucial early step in understanding the data. Segmentation, or offline change-point analysis, divides data into homogeneous temporal or spatial segments, making subsequent analysis easier; its online counterpart detects changes in sequentially observed data, allowing for real-time anomaly detection. This article reviews a nonparametric change-point analysis framework that utilizes graphs representing the similarity between observations. This framework can be applied to data as long as a reasonable dissimilarity distance among the observations can be defined. Thus, this framework can be applied to a wide range of applications, from high-dimensional data to non-Euclidean data, such as imaging data or network data. In addition, analytic formulas can be derived to control the false discoveries, making them easy off-the-shelf data analysis tools.
-