Annual Review of Statistics and Its Application - Current Issue
Volume 11, 2024
-
-
Communication of Statistics and Evidence in Times of Crisis
Vol. 11 (2024), pp. 1–26More LessThis review provides an overview of concepts relating to the communication of statistical and empirical evidence in times of crisis, with a special focus on COVID-19. In it, we consider topics relating to both the communication of numbers, such as the role of format, context, comparisons, and visualization, and the communication of evidence more broadly, such as evidence quality, the influence of changes in available evidence, transparency, and repeated decision-making. A central focus is on the communication of the inherent uncertainties in statistical analysis, especially in rapidly changing informational environments during crises. We present relevant literature on these topics and draw connections to the communication of statistics and empirical evidence during the COVID-19 pandemic and beyond. We finish by suggesting some considerations for those faced with communicating statistics and evidence in times of crisis.
-
-
-
Role of Statistics in Detecting Misinformation: A Review of the State of the Art, Open Issues, and Future Research Directions
Vol. 11 (2024), pp. 27–50More LessWith the evolution of social media, cyberspace has become the default medium for social media users to communicate, especially during high-impact events such as pandemics, natural disasters, terrorist attacks, and periods of political unrest. However, during such events, misinformation can spread rapidly on social media, affecting decision-making and creating social unrest. Identifying and curtailing the spread of misinformation during high-impact events are significant data challenges given the scarcity and variety of the data, the speed by which misinformation can propagate, and the fairness aspects associated with this societal problem. Recent statistical machine learning advances have shown promise for misinformation detection; however, key limitations still make this a significant challenge. These limitations relate to using representative and bias-free multimodal data and to the explainability, fairness, and reliable performance of a system that detects misinformation. In this article, we critically discuss the current state-of-the-art approaches that attempt to respond to these complex requirements and present major unsolved issues; future research directions; and the synergies among statistics, data science, and other sciences for detecting misinformation.
-
-
-
Stochastic Models of Rainfall
Vol. 11 (2024), pp. 51–74More LessRainfall is the main input to most hydrological systems. To assess flood risk for a catchment area, hydrologists use models that require long series of subdaily, perhaps even subhourly, rainfall data, ideally from locations that cover the area. If historical data are not sufficient for this purpose, an alternative is to simulate synthetic data from a suitably calibrated model. We review stochastic models that have a mechanistic structure, intended to mimic physical features of the rainfall processes, and are constructed using stationary point processes. We describe models for temporal and spatial-temporal rainfall and consider how they can be fitted to data. We provide an example application using a temporal model and an illustration of data simulated from a spatial-temporal model. We discuss how these models can contribute to the simulation of future rainfall that reflects our changing climate.
-
-
-
Maps: A Statistical View
Vol. 11 (2024), pp. 75–96More LessMaps provide a data framework for the statistical analysis of georeferenced data observations. Since the middle of the twentieth century, the field of spatial statistics has evolved to address key inferential questions relating to spatially defined data, yet many central statistical properties do not translate to spatially indexed and spatially correlated data, and the development of statistical inference for mapped data remains an active area of research. Rather than review statistical techniques, we review the different ways the maps of georeferenced data can influence statistical analysis, focusing especially on maps as data visualization, maps as data structures, and maps as statistics themselves, i.e., summaries of underlying patterns with accompanying uncertainty. The categories provide connections to disparate literatures addressing spatial analysis including data visualization, cartography, spatial statistics, and geography. We find maps integrate spatial analysis from motivating questions, informing analytic methods, and providing context for results.
-
-
-
Interpretable Machine Learning for Discovery: Statistical Challenges and Opportunities
Vol. 11 (2024), pp. 97–121More LessNew technologies have led to vast troves of large and complex data sets across many scientific domains and industries. People routinely use machine learning techniques not only to process, visualize, and make predictions from these big data, but also to make data-driven discoveries. These discoveries are often made using interpretable machine learning, or machine learning models and techniques that yield human-understandable insights. In this article, we discuss and review the field of interpretable machine learning, focusing especially on the techniques, as they are often employed to generate new knowledge or make discoveries from large data sets. We outline the types of discoveries that can be made using interpretable machine learning in both supervised and unsupervised settings. Additionally, we focus on the grand challenge of how to validate these discoveries in a data-driven manner, which promotes trust in machine learning systems and reproducibility in science. We discuss validation both from a practical perspective, reviewing approaches based on data-splitting and stability, as well as from a theoretical perspective, reviewing statistical results on model selection consistency and uncertainty quantification via statistical inference. Finally, we conclude byhighlighting open challenges in using interpretable machine learning techniques to make discoveries, including gaps between theory and practice for validating data-driven discoveries.
-
-
-
Causal Inference in the Social Sciences
Vol. 11 (2024), pp. 123–152More LessKnowledge of causal effects is of great importance to decision makers in a wide variety of settings. In many cases, however, these causal effects are not known to the decision makers and need to be estimated from data. This fundamental problem has been known and studied for many years in many disciplines. In the past thirty years, however, the amount of empirical as well as methodological research in this area has increased dramatically, and so has its scope. It has become more interdisciplinary, and the focus has been more specifically on methods for credibly estimating causal effects in a wide range of both experimental and observational settings. This work has greatly impacted empirical work in the social and biomedical sciences. In this article, I review some of this work and discuss open questions.
-
-
-
Variable Importance Without Impossible Data
Vol. 11 (2024), pp. 153–178More LessThe most popular methods for measuring importance of the variables in a black-box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple observations. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead, we advocate a method called cohort Shapley, which is grounded in economic game theory and uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of observations judged to be similar to a target observation on one or more features. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on.
-
-
-
Bayesian Inference for Misspecified Generative Models
Vol. 11 (2024), pp. 179–202More LessBayesian inference is a powerful tool for combining information in complex settings, a task of increasing importance in modern applications. However, Bayesian inference with a flawed model can produce unreliable conclusions. This review discusses approaches to performing Bayesian inference when the model is misspecified, where, by misspecified, we mean that the analyst is unwilling to act as if the model is correct. Much has been written about this topic, and in most cases we do not believe that a conventional Bayesian analysis is meaningful when there is serious model misspecification. Nevertheless, in some cases it is possible to use a well-specified model to give meaning to a Bayesian analysis of a misspecified model, and we focus on such cases. Three main classes of methods are discussed: restricted likelihood methods, which use a model based on an insufficient summary of the original data; modular inference methods, which use a model constructed from coupled submodels, with some of the submodels correctly specified; and the use of a reference model to construct a projected posterior or predictive distribution for a simplified model considered to be useful for prediction or interpretation.
-
-
-
The Role of the Bayes Factor in the Evaluation of Evidence
Vol. 11 (2024), pp. 203–226More LessThe use of the Bayes factor as a metric for the assessment of the probative value of forensic scientific evidence is largely supported by recommended standards in different disciplines. The application of Bayesian networks enables the consideration of problems of increasing complexity. The lack of a widespread consensus concerning key aspects of evidence evaluation and interpretation, such as the adequacy of a probabilistic framework for handling uncertainty or the method by which conclusions regarding how the strength of the evidence should be reported to a court, has meant the role of the Bayes factor in the administration of criminal justice has come under increasing challenge in recent years. We review the many advantages the Bayes factor has as an approach to the evaluation and interpretation of evidence.
-
-
-
Competing Risks: Concepts, Methods, and Software
Vol. 11 (2024), pp. 227–254More LessThe role of competing risks in the analysis of time-to-event data is increasingly acknowledged. Software is readily available. However, confusion remains regarding the proper analysis: When and how do I need to take the presence of competing risks into account? Which quantities are relevant for my research question? How can they be estimated and what assumptions do I need to make? The main quantities in a competing risks analysis are the cause-specific cumulative incidence, the cause-specific hazard, and the subdistribution hazard. We describe their nonparametric estimation, give an overview of regression models for each of these quantities, and explain their difference in interpretation. We discuss the proper analysis in relation to the type of study question, and we suggest software in R and Stata. Our focus is on competing risks analysis in medical research, but methods can equally be applied in other fields like social science, engineering, and economics.
-
-
-
Making Sense of Censored Covariates: Statistical Methods for Studies of Huntington's Disease
Vol. 11 (2024), pp. 255–277More LessThe landscape of survival analysis is constantly being revolutionized to answer biomedical challenges, most recently the statistical challenge of censored covariates rather than outcomes. There are many promising strategies to tackle censored covariates, including weighting, imputation, maximum likelihood, and Bayesian methods. Still, this is a relatively fresh area of research, different from the areas of censored outcomes (i.e., survival analysis) or missing covariates. In this review, we discuss the unique statistical challenges encountered when handling censored covariates and provide an in-depth review of existing methods designed to address those challenges. We emphasize each method's relative strengths and weaknesses, providing recommendations to help investigators pinpoint the best approach to handling censored covariates in their data.
-
-
-
An Update on Measurement Error Modeling
Mushan Li, and Yanyuan MaVol. 11 (2024), pp. 279–296More LessThe issues caused by measurement errors have been recognized for almost 90 years, and research in this area has flourished since the 1980s. We review some of the classical methods in both density estimation and regression problems with measurement errors. In both problems, we consider when the original error-free model is parametric, nonparametric, and semiparametric, in combination with different error types. We also summarize and explain some new approaches, including recent developments and challenges in the high-dimensional setting.
-
-
-
Relational Event Modeling
Vol. 11 (2024), pp. 297–319More LessAdvances in information technology have increased the availability of time-stamped relational data, such as those produced by email exchanges or interaction through social media. Whereas the associated information flows could be aggregated into cross-sectional panels, the temporal ordering of the events frequently contains information that requires new models for the analysis of continuous-time interactions, subject to both endogenous and exogenous influences. The introduction of the relational event model (REM) has been a major development that has stimulated new questions and led to further methodological developments. In this review, we track the intellectual history of the REM, define its core properties, and discuss why and how it has been considered useful in empirical research. We describe how the demands of novel applications have stimulated methodological, computational, and inferential advancements.
-
-
-
Distributional Regression for Data Analysis
Vol. 11 (2024), pp. 321–346More LessThe flexible modeling of an entire distribution as a function of covariates, known as distributional regression, has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices.
-
-
-
Recent Advances in Text Analysis
Vol. 11 (2024), pp. 347–372More LessText analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.
-
-
-
Shape-Constrained Statistical Inference
Vol. 11 (2024), pp. 373–391More LessStatistical models defined by shape constraints are a valuable alternative to parametric models or nonparametric models defined in terms of quantitative smoothness constraints. While the latter two classes of models are typically difficult to justify a priori, many applications involve natural shape constraints, for instance, monotonicity of a density or regression function. We review some of the history of this subject and recent developments, with special emphasis on algorithmic aspects, adaptivity, honest confidence bands for shape-constrained curves, and distributional regression, i.e., inference about the conditional distribution of a real-valued response given certain covariates.
-
-
-
Manifold Learning: What, How, and Why
Marina Meilă, and Hanyu ZhangVol. 11 (2024), pp. 393–417More LessManifold learning (ML), also known as nonlinear dimension reduction, is a set of methods to find the low-dimensional structure of data. Dimension reduction for large, high-dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high-dimensional point clouds and allow one to visualize, denoise, and interpret them. This review presents the underlying principles of ML, its representative methods, and their statistical foundations, all from a practicing statistician's perspective. It describes the trade-offs and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
-
-
-
Convergence Diagnostics for Entity Resolution
Vol. 11 (2024), pp. 419–435More LessEntity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.
-
-
-
Geometric Methods for Cosmological Data on the Sphere
Vol. 11 (2024), pp. 437–460More LessThis review is devoted to recent developments in the statistical analysis of spherical data, strongly motivated by applications in cosmology. We start from a brief discussion of cosmological questions and motivations, arguing that most cosmological observables are spherical random fields. Then, we introduce some mathematical background on spherical random fields, including spectral representations and the construction of needlet and wavelet frames. We then focus on some specific issues, including tools and algorithms for map reconstruction (i.e., separating the different physical components that contribute to the observed field), geometric tools for testing the assumptions of Gaussianity and isotropy, and multiple testing methods to detect contamination in the field due to point sources. Although these tools are introduced in the cosmological context, they can be applied to other situations dealing with spherical data. Finally, we discuss more recent and challenging issues, such as the analysis of polarization data, which can be viewed as realizations of random fields taking values in spin fiber bundles.
-
-
-
Inverse Problems for Physics-Based Process Models
Vol. 11 (2024), pp. 461–482More LessWe describe and compare two formulations of inverse problems for a physics-based process model in the context of uncertainty and random variability: the Bayesian inverse problem and the stochastic inverse problem. We describe the foundations of the two problems in order to create a context for interpreting the applicability and solutions of inverse problems important for scientific and engineering inference. We conclude by comparing them to statistical approaches to related problems, including Bayesian calibration of computer models.
-
-
-
Analysis of Microbiome Data
Vol. 11 (2024), pp. 483–504More LessThe microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.
-
-
-
Statistical Brain Network Analysis
Vol. 11 (2024), pp. 505–531More LessThe recent fusion of network science and neuroscience has catalyzed a paradigm shift in how we study the brain and led to the field of brain network analysis. Brain network analyses hold great potential in helping us understand normal and abnormal brain function by providing profound clinical insight into links between system-level properties and health and behavioral outcomes. Nonetheless, methods for statistically analyzing networks at the group and individual levels have lagged behind. We have attempted to address this need by developing three complementary statistical frameworks—a mixed modeling framework, a distance regression framework, and a hidden semi-Markov modeling framework. These tools serve as synergistic fusions of statistical approaches with network science methods, providing needed analytic foundations for whole-brain network data. Here we delineate these approaches, briefly survey related tools, and discuss potential future avenues of research. We hope this review catalyzes further statistical interest and methodological development in the field.
-
-
-
Distributed Computing and Inference for Big Data
Vol. 11 (2024), pp. 533–551More LessData are distributed across different sites due to computing facility limitations or data privacy considerations. Conventional centralized methods—those in which all datasets are stored and processed in a central computing facility—are not applicable in practice. Therefore, it has become necessary to develop distributed learning approaches that have good inference or predictive accuracy while remaining free of individual data or obeying policies and regulations to protect privacy. In this article, we introduce the basic idea of distributed learning and conduct a selected review on various distributed learning methods, which are categorized by their statistical accuracy, computational efficiency, heterogeneity, and privacy. This categorization can help evaluate newly proposed methods from different aspects. Moreover, we provide up-to-date descriptions of the existing theoretical results that cover statistical equivalency and computational efficiency under different statistical learning frameworks. Finally, we provide existing software implementations and benchmark datasets, and we discuss future research opportunities.
-