- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Early Publication
Annual Review of Statistics and Its Application - Early Publication
Reviews in Advance appear online ahead of the full published volume. View expected publication dates for upcoming volumes.
1 - 20 of 22 results
-
-
A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models
Namjoon Suh, and Guang ChengFirst published online: 21 November 2024More LessIn this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics, and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression. These results rely on explicit constructions of neural networks, leading to fast convergence rates of excess risks. Nonetheless, their underlying analysis only applies to the global minimizer in the highly nonconvex landscape of deep neural networks. This motivates us to review the training dynamics of neural networks in the second part. Specifically, we review articles that attempt to answer the question of how a neural network trained via gradient-based methods finds a solution that can generalize well on unseen data. In particular, two well-known paradigms are reviewed: the neural tangent kernel and mean-field paradigms. Last, we review the most recent theoretical advancements in generative models, including generative adversarial networks, diffusion models, and in-context learning in large language models from two of the same perspectives, approximation and training dynamics.
-
-
-
Models and Rating Systems for Head-to-Head Competition
First published online: 20 November 2024More LessOne of the most important tasks in sports analytics is the development of binary response models for head-to-head game outcomes to estimate team and player strength. We discuss commonly used probability models for game outcomes, including the Bradley–Terry and Thurstone–Mosteller models, as well as extensions to ties as a third outcome and to the inclusion of a home-field advantage. We consider dynamic extensions to these models to account for the evolution of competitor strengths over time. Full likelihood-based analyses of these time-varying models can be simplified into rating systems, such as the Elo and Glicko rating systems. We present other modern rating systems, including popular methods for online gaming, and novel systems that have been implemented for online chess and Go. The discussion of the analytic methods are accompanied by examples of where these approaches have been implemented for various gaming organizations, as well as a detailed application to National Basketball Association game outcomes.
-
-
-
A Review of Reinforcement Learning in Financial Applications
Yahui Bai, Yuhe Gao, Runzhe Wan, Sheng Zhang, and Rui SongFirst published online: 15 November 2024More LessIn recent years, there has been a growing trend of applying reinforcement learning (RL) in financial applications. This approach has shown great potential for decision-making tasks in finance. In this review, we present a comprehensive study of the applications of RL in finance and conduct a series of meta-analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL's performance compared with traditional methods. Moreover, we identify challenges, including explainability, Markov decision process modeling, and robustness, that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi-agent RL, and model-based RL to address these challenges and to further enhance the implementation of RL in finance.
-
-
-
Joint Modeling of Longitudinal and Survival Data
First published online: 14 November 2024More LessIn medical studies, time-to-event outcomes such as time to death or relapse of a disease are routinely recorded along with longitudinal data that are observed intermittently during the follow-up period. For various reasons, marginal approaches to model the event time, corresponding to separate approaches for survival data/longitudinal data, tend to induce bias and lose efficiency. Instead, a joint modeling approach that brings the two types of data together can reduce or eliminate the bias and yield a more efficient estimation procedure. A well-established avenue for joint modeling is the joint likelihood approach that often produces semiparametric efficient estimators for the finite-dimensional parameter vectors in both models. Through a transformation survival model with an unspecified baseline hazard function, this review introduces joint modeling that accommodates both baseline covariates and time-varying covariates. The focus is on the major challenges faced by joint modeling and how they can be overcome. A review of available software implementations and a brief discussion of future directions of the field are also included.
-
-
-
Neural Methods for Amortized Inference
First published online: 12 November 2024More LessSimulation-based methods for statistical inference have evolved dramatically over the past 50 years, keeping pace with technological advancements. The field is undergoing a new revolution as it embraces the representational capacity of neural networks, optimization libraries, and graphics processing units for learning complex mappings between data and inferential targets. The resulting tools are amortized, in the sense that, after an initial setup cost, they allow rapid inference through fast feed-forward operations. In this article we review recent progress in the context of point estimation, approximate Bayesian inference, summary-statistic construction, and likelihood approximation. We also cover software and include a simple illustration to showcase the wide array of tools available for amortized inference and the benefits they offer over Markov chain Monte Carlo methods. The article concludes with an overview of relevant topics and an outlook on future research directions.
-
-
-
Empirical Likelihood in Functional Data Analysis
First published online: 12 November 2024More LessFunctional data analysis (FDA) studies data that include infinite-dimensional functions or objects, generalizing traditional univariate or multivariate observations from each study unit. Among inferential approaches without parametric assumptions, empirical likelihood (EL) offers a principled method in that it extends the framework of parametric likelihood ratio–based inference via the nonparametric likelihood. There has been increasing use of EL in FDA due to its many favorable properties, including self-normalization and the data-driven shape of confidence regions. This article presents a review of EL approaches in FDA, starting with finite-dimensional features, then covering infinite-dimensional features. We contrast smooth and nonsmooth frameworks in FDA and show how EL has been incorporated into both of them. The article concludes with a discussion of some future research directions, including the possibility of applying EL to conformal inference.
-
-
-
Excess Mortality Estimation
First published online: 12 November 2024More LessEstimating the mortality associated with a specific mortality crisis event (for example, a pandemic, natural disaster, or conflict) is clearly an important public health undertaking. In many situations, deaths may be directly or indirectly attributable to the mortality crisis event, and both contributions may be of interest. The totality of the mortality impact on the population (direct and indirect deaths) includes the knock-on effects of the event, such as a breakdown of the health care system, or increased mortality due to shortages of resources. Unfortunately, estimating the deaths directly attributable to the event is frequently problematic. Hence, the excess mortality, defined as the difference between the observed mortality and that which would have occurred in the absence of the crisis event, is an estimation target. If the region of interest contains a functioning vital registration system, so that the mortality is fully observed and reliable, then the only modeling required is to produce the expected deaths counts, but this is a nontrivial exercise. In low- and middle-income countries it is common for there to be incomplete (or nonexistent) mortality data, and one must then use additional data and/or modeling, including predicting mortality using auxiliary variables. We describe and review each of these aspects, give examples of excess mortality studies, and provide a case study on excess mortality across states of the United States during the COVID-19 pandemic.
-
-
-
Infectious Disease Modeling
First published online: 12 November 2024More LessInfectious diseases pose a persistent challenge to public health worldwide. Recent global health crises, such as the COVID-19 pandemic and Ebola outbreaks, have underscored the vital role of infectious disease modeling in guiding public health policy and response. Infectious disease modeling is a critical tool for society, informing risk mitigation measures, prompting timely interventions, and aiding preparedness for healthcare delivery systems. This article synthesizes the current landscape of infectious disease modeling, emphasizing the integration of statistical methods in understanding and predicting the spread of infectious diseases. We begin by examining the historical context and the foundational models that have shaped the field, such as the SIR (susceptible, infectious, recovered) and SEIR (susceptible, exposed, infectious, recovered) models. Subsequently, we delve into the methodological innovations that have arisen, including stochastic modeling, network-based approaches, and the use of big data analytics. We also explore the integration of machine learning techniques in enhancing model accuracy and responsiveness. The review identifies the challenges of parameter estimation, model validation, and the incorporation of real-time data streams. Moreover, we discuss the ethical implications of modeling, such as privacy concerns and the communication of risk. The article concludes by discussing future directions for research, highlighting the need for data integration and interdisciplinary collaboration for advancing infectious disease modeling.
-
-
-
Tensors in High-Dimensional Data Analysis: Methodological Opportunities and Theoretical Challenges
Arnab Auddy, Dong Xia, and Ming YuanFirst published online: 12 November 2024More LessLarge amounts of multidimensional data represented by multiway arrays or tensors are prevalent in modern applications across various fields such as chemometrics, genomics, physics, psychology, and signal processing. The structural complexity of such data provides vast new opportunities for modeling and analysis, but efficiently extracting information content from them, both statistically and computationally, presents unique and fundamental challenges. Addressing these challenges requires an interdisciplinary approach that brings together tools and insights from statistics, optimization, and numerical linear algebra, among other fields. Despite these hurdles, significant progress has been made in the past decade. This review seeks to examine some of the key advancements and identify common threads among them, under a number of different statistical settings.
-
-
-
Designs for Vaccine Studies
First published online: 30 October 2024More LessDue to dependent happenings, vaccines can have different effects in populations. In addition to direct protective effects in the vaccinated, vaccination in a population can have indirect effects in the unvaccinated individuals. Vaccination can also reduce person-to-person transmission to vaccinated individuals or from vaccinated individuals compared with unvaccinated individuals. Design of vaccine studies has a history extending back over a century. Emerging infectious diseases, such as the SARS-CoV-2 pandemic and the Ebola outbreak in West Africa, have stimulated new interest in vaccine studies. We focus on some recent developments, such as target trial emulation, test-negative design, and regression discontinuity design. Methods for evaluating durability of vaccine effects were developed in the context of both blinded and unblinded placebo crossover studies. The case-ascertained design is used to assess the transmission effects of vaccines. The novel ring vaccination trial design was first used in the Ebola outbreak in West Africa.
-
-
-
Causal Mediation Analysis for Integrating Exposure, Genomic, and Phenotype Data
First published online: 30 October 2024More LessCausal mediation analysis provides an attractive framework for integrating diverse types of exposure, genomic, and phenotype data. Recently, this field has seen a surge of interest, largely driven by the increasing need for causal mediation analyses in health and social sciences. This article aims to provide a review of recent developments in mediation analysis, encompassing mediation analysis of a single mediator and a large number of mediators, as well as mediation analysis with multiple exposures and mediators. Our review focuses on the recent advancements in statistical inference for causal mediation analysis, especially in the context of high-dimensional mediation analysis. We delve into the complexities of testing mediation effects, especially addressing the challenge of testing a large number of composite null hypotheses. Through extensive simulation studies, we compare the existing methods across a range of scenarios. We also include an analysis of data from the Normative Aging Study, which examines DNA methylation CpG sites as potential mediators of the effect of smoking status on lung function. We discuss the pros and cons of these methods and future research directions.
-
-
-
A Statistical Viewpoint on Differential Privacy: Hypothesis Testing, Representation, and Blackwell's Theorem
First published online: 18 October 2024More LessDifferential privacy is widely considered the formal privacy for privacy-preserving data analysis due to its robust and rigorous guarantees, with increasingly broad adoption in public services, academia, and industry. Although differential privacy originated in the cryptographic context, in this review we argue that, fundamentally, it can be considered a pure statistical concept. We leverage Blackwell's informativeness theorem and focus on demonstrating that the definition of differential privacy can be formally motivated from a hypothesis testing perspective, thereby showing that hypothesis testing is not merely convenient but also the right language for reasoning about differential privacy. This insight leads to the definition of f-differential privacy, which extends other differential privacy definitions through a representation theorem. We review techniques that render f-differential privacy a unified framework for analyzing privacy bounds in data analysis and machine learning. Applications of this differential privacy definition to private deep learning, private convex optimization, shuffled mechanisms, and US Census data are discussed to highlight the benefits of analyzing privacy bounds under this framework compared with existing alternatives.
-
-
-
Reproducibility in the Classroom
First published online: 09 October 2024More LessDifficulties in reproducing results from scientific studies have lately been referred to as a reproducibility crisis. Scientific practice depends heavily on scientific training. What gets taught in the classroom is often practiced in labs, fields, and data analysis. The importance of reproducibility in the classroom has gained momentum in statistics education in recent years. In this article, we review the existing literature on reproducibility education. We delve into the relationship between computing tools and reproducibility through visiting historical developments in this area. We share examples for teaching reproducibility and reproducible teaching while discussing the pedagogical opportunities created by these examples as well as challenges that the instructors should be aware of. We detail the use of teaching reproducibility and reproducible teaching practices in an introductory data science course. Lastly, we provide recommendations on reproducibility education for instructors, administrators, and other members of the scientific community.
-
-
-
Generalized Additive Models
First published online: 07 October 2024More LessGeneralized additive models are generalized linear models in which the linear predictor includes a sum of smooth functions of covariates, where the shape of the functions is to be estimated. They have also been generalized beyond the original generalized linear model setting to distributions outside the exponential family and to situations in which multiple parameters of the response distribution may depend on sums of smooth functions of covariates. The widely used computational and inferential framework in which the smooth terms are represented as latent Gaussian processes, splines, or Gaussian random effects is reviewed, paying particular attention to the case in which computational and theoretical tractability is obtained by prior rank reduction of the model terms. An empirical Bayes approach is taken, and its relatively good frequentist performance discussed, along with some more overtly frequentist approaches to model selection. Estimation of the degree of smoothness of component functions via cross validation or marginal likelihood is covered, alongside the computational strategies required in practice, including when data and models are reasonably large. It is briefly shown how the framework extends easily to location-scale modeling, and, with more effort, to techniques such as quantile regression. Also covered are the main classes of smooths of multiple covariates that may be included in models: isotropic splines and tensor product smooth interaction terms.
-
-
-
Hawkes Models and Their Applications
First published online: 01 October 2024More LessThe Hawkes process is a model for counting the number of arrivals to a system that exhibits the self-exciting property—that one arrival creates a heightened chance of further arrivals in the near future. The model and its generalizations have been applied in a plethora of disparate domains, though two particularly developed applications are in seismology and in finance. As the original model is elegantly simple, generalizations have been proposed that track marks for each arrival, are multivariate, have a spatial component, are driven by renewal processes, treat time as discrete, and so on. This article creates a cohesive review of the traditional Hawkes model and the modern generalizations, providing details on their construction and simulation algorithms, and giving key references to the appropriate literature for a detailed treatment.
-
-
-
Statistics in Phonetics
First published online: 01 October 2024More LessPhonetics is the scientific field concerned with the study of how speech is produced, heard, and perceived. It abounds with data, such as acoustic speech recordings, neuroimaging data, or articulatory data. In this article, we provide an introduction to different areas of phonetics (acoustic phonetics, sociophonetics, speech perception, articulatory phonetics, speech inversion, sound change, and speech technology), an overview of the statistical methods for analyzing their data, and an introduction to the signal processing methods commonly applied to speech recordings. A major transition in the statistical modeling of phonetic data has been the shift from fixed effects to random effects regression models, the modeling of curve data (for instance, via generalized additive mixed models or functional data analysis methods), and the use of Bayesian methods. This shift has been driven in part by the increased focus on large speech corpora in phonetics, which has arisen from machine learning methods such as forced alignment. We conclude by identifying opportunities for future research.
-
-
-
Identification and Inference with Invalid Instruments
First published online: 25 September 2024More LessInstrumental variables (IVs) are widely used to study the causal effect of an exposure on an outcome in the presence of unmeasured confounding. IVs require an instrument, a variable that (a) is associated with the exposure, (b) has no direct effect on the outcome except through the exposure, and (c) is not related to unmeasured confounders. Unfortunately, finding variables that satisfy conditions b or c can be challenging in practice. This article reviews works where instruments may not satisfy conditions b or c, which we refer to as invalid instruments. We review identification and inference under different violations of b or c, specifically under linear models, nonlinear models, and heteroskedastic models. We conclude with an empirical comparison of various methods by reanalyzing the effect of body mass index on systolic blood pressure from the UK Biobank.
-
-
-
Measuring the Functioning Human Brain
First published online: 11 September 2024More LessThe emergence of functional magnetic resonance imaging (fMRI) marked a significant technological breakthrough in the real-time measurement of the functioning human brain in vivo. In part because of their 4D nature (three spatial dimensions and time), fMRI data have inspired a great deal of statistical development in the past couple of decades to address their unique spatiotemporal properties. This article provides an overview of the current landscape in functional brain measurement, with a particular focus on fMRI, highlighting key developments in the past decade. Furthermore, it looks ahead to the future, discussing unresolved research questions in the community and outlining potential research topics for the future.
-
-
-
High-Dimensional Gene–Environment Interaction Analysis
Mengyun Wu, Yingmeng Li, and Shuangge MaFirst published online: 11 September 2024More LessBeyond the main genetic and environmental effects, gene–environment (G–E) interactions have been demonstrated to significantly contribute to the development and progression of complex diseases. Published analyses of G–E interactions have primarily used a supervised framework to model both low-dimensional environmental factors and high-dimensional genetic factors in relation to disease outcomes. In this article, we aim to provide a selective review of methodological developments in G–E interaction analysis from a statistical perspective. The three main families of techniques are hypothesis testing, variable selection, and dimension reduction, which lead to three general frameworks: testing-based, estimation-based, and prediction-based. Linear- and nonlinear-effects analysis, fixed- and random-effects analysis, marginal and joint analysis, and Bayesian and frequentist analysis are reviewed to facilitate the conduct of interaction analysis in a wide range of situations with various assumptions and objectives. Statistical properties, computations, applications, and future directions are also discussed.
-
-
-
A Theoretical Review of Modern Robust Statistics
First published online: 21 August 2024More LessRobust statistics is a fairly mature field that dates back to the early 1960s, with many foundational concepts having been developed in the ensuing decades. However, the field has drawn a new surge of attention in the past decade, largely due to a desire to recast robust statistical principles in the context of high-dimensional statistics. In this article, we begin by reviewing some of the central ideas in classical robust statistics. We then discuss the need for new theory in high dimensions, using recent work in high-dimensional M-estimation as an illustrative example. Next, we highlight a variety of interesting recent topics that have drawn a flurry of research activity from both statisticians and theoretical computer scientists, demonstrating the need for further research in robust estimation that embraces new estimation and contamination settings, as well as a greater emphasis on computational tractability in high dimensions.
-