- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 8, 2021
Annual Review of Statistics and Its Application - Volume 8, 2021
Volume 8, 2021
-
-
Modeling Player and Team Performance in Basketball
Vol. 8 (2021), pp. 1–23More LessIn recent years, analytics has started to revolutionize the game of basketball: Quantitative analyses of the game inform team strategy; management of player health and fitness; and how teams draft, sign, and trade players. In this review, we focus on methods for quantifying and characterizing basketball gameplay. At the team level, we discuss methods for characterizing team strategy and performance, while at the player level, we take a deep look into a myriad of tools for player evaluation. This includes metrics for overall player value, defensive ability, and shot modeling, and methods for understanding performance over multiple seasons via player production curves. We conclude with a discussion on the future of basketball analytics and, in particular, highlight the need for causal inference in sports.
-
-
-
Graduate Education in Statistics and Data Science: The Why, When, Where, Who, and What
Vol. 8 (2021), pp. 25–39More LessOrganizing a graduate program in statistics and data science raises many questions, offering a variety of opportunities while presenting a multitude of choices. The call for graduate programs in statistics and data science is overwhelming. How does it align with other (future) study programs at the secondary and postsecondary levels? What could or should be the natural home for data science in academia? Who meets the entry criteria, and who does not? Which strategic choices inevitably play a prominent role when developing a curriculum? We share our views on the why, when, where, who and what.
-
-
-
Statistical Evaluation of Medical Tests
Vol. 8 (2021), pp. 41–67More LessIn this review, we present an overview of the main aspects related to the statistical evaluation of medical tests for diagnosis and prognosis. Measures of diagnostic performance for binary tests, such as sensitivity, specificity, and predictive values, are introduced, and extensions to the case of continuous-outcome tests are detailed. Special focus is placed on the receiver operating characteristic (ROC) curve and its estimation, with emphasis on the topic of covariate adjustment. The extension to the case of time-dependent ROC curves for evaluating prognostic accuracy is also touched upon. We apply several of the approaches described to a data set derived from a study aimed to evaluate the ability of homeostasis model assessment of insulin resistance (HOMA-IR) levels to identify individuals at high cardio-metabolic risk and how such discriminatory ability might be influenced by age and gender. We also outline software available for the implementation of the methods.
-
-
-
Simulation and Analysis Methods for Stochastic Compartmental Epidemic Models
Vol. 8 (2021), pp. 69–88More LessThis article considers simulation and analysis of incidence data using stochastic compartmental models in well-mixed populations. Several simulation approaches are described and compared. Thereafter, we provide an overview of likelihood estimation for stochastic models. We apply one such method to a real-life outbreak data set and compare models assuming different kinds of stochasticity. We also give references for other publications where detailed information on this topic can be found.
-
-
-
Missing Data Assumptions
Vol. 8 (2021), pp. 89–107More LessI review assumptions about the missing-data mechanisms that underlie methods for the statistical analysis of data with missing values. I describe Rubin's original definition of missing at random (MAR), its motivation and criticisms, and his sufficient conditions for ignoring the missingness mechanism for likelihood-based, Bayesian, and frequentist inference. Related definitions, including missing completely at random, always MAR, always missing completely at random, and partially MAR, are also covered. I present a formal argument for weakening Rubin's sufficient conditions for frequentist maximum likelihood inference with precision based on the observed information. Some simple examples of MAR are described, together with an example where the missingness mechanism can be ignored even though MAR does not hold. Alternative approaches to statistical inference based on the likelihood function are reviewed, along with non-likelihood frequentist approaches, including weighted generalized estimating equations. Connections with the causal inference literature are also discussed. Finally, alternatives to Rubin's MAR definition are discussed, including informative missingness, informative censoring, and coarsening at random. The intent is to provide a relatively nontechnical discussion, although some of the underlying issues are challenging and touch on fundamental questions of statistical inference.
-
-
-
Consequences of Asking Sensitive Questions in Surveys
Vol. 8 (2021), pp. 109–127More LessI review selected articles from the survey methodology literature on the consequences of asking sensitive questions in censuses and surveys, using a total survey error (TSE) framework. I start with definitions of sensitive questions and move to examination of the impact of including sensitive questions on various sources of survey error—specifically, survey respondents’ willingness to participate in a survey (unit nonresponse), their willingness to respond to next rounds of interviews (wave nonresponse), their likelihood to provide an answer to sensitive questions after agreeing to participate in the survey (item nonresponse), and the accuracy of respondents’ answers to sensitive questions (measurement error). I also review the simultaneous impact of sensitive questions on multiple sources of error in survey estimates and discuss strategies to mitigate the impact of asking sensitive questions on measurement errors. I conclude with a summary and suggestions for future research.
-
-
-
Synthetic Data
Vol. 8 (2021), pp. 129–140More LessDemand for access to data, especially data collected using public funds, is ever growing. At the same time, concerns about the disclosure of the identities of and sensitive information about the respondents providing the data are making the data collectors limit the access to data. Synthetic data sets, generated to emulate certain key information found in the actual data and provide the ability to draw valid statistical inferences, are an attractive framework to afford widespread access to data for analysis while mitigating privacy and confidentiality concerns. The goal of this article is to provide a review of various approaches for generating and analyzing synthetic data sets, inferential justification, limitations of the approaches, and directions for future research.
-
-
-
Algorithmic Fairness: Choices, Assumptions, and Definitions
Vol. 8 (2021), pp. 141–163More LessA recent wave of research has attempted to define fairness quantitatively. In particular, this work has explored what fairness might mean in the context of decisions based on the predictions of statistical and machine learning models. The rapid growth of this new field has led to wildly inconsistent motivations, terminology, and notation, presenting a serious challenge for cataloging and comparing definitions. This article attempts to bring much-needed order. First, we explicate the various choices and assumptions made—often implicitly—to justify the use of prediction-based decision-making. Next, we show how such choices and assumptions can raise fairness concerns and we present a notationally consistent catalog of fairness definitions from the literature. In doing so, we offer a concise reference for thinking through the choices, assumptions, and fairness considerations of prediction-based decision-making.
-
-
-
Online Learning Algorithms
Vol. 8 (2021), pp. 165–190More LessOnline learning is a framework for the design and analysis of algorithms that build predictive models by processing data one at the time. Besides being computationally efficient, online algorithms enjoy theoretical performance guarantees that do not rely on statistical assumptions on the data source. In this review, we describe some of the most important algorithmic ideas behind online learning and explain the main mathematical tools for their analysis. Our reference framework is online convex optimization, a sequential version of convex optimization within which most online algorithms are formulated. More specifically, we provide an in-depth description of online mirror descent and follow the regularized leader, two of the most fundamental algorithms in online learning. As the tuning of parameters is a typically difficult task in sequential data analysis, in the last part of the review we focus on coin-betting, an information-theoretic approach to the design of parameter-free online algorithms with good theoretical guarantees.
-
-
-
Space-Time Covariance Structures and Models
Vol. 8 (2021), pp. 191–215More LessIn recent years, interest has grown in modeling spatio-temporal data generated from monitoring networks, satellite imaging, and climate models. Under Gaussianity, the covariance function is core to spatio-temporal modeling, inference, and prediction. In this article, we review the various space-time covariance structures in which simplified assumptions, such as separability and full symmetry, are made to facilitate computation, and associated tests intended to validate these structures. We also review recent developments on constructing space-time covariance models, which can be separable or nonseparable, fully symmetric or asymmetric, stationary or nonstationary, univariate or multivariate, and in Euclidean spaces or on the sphere. We visualize some of the structures and models with visuanimations. Finally, we discuss inference for fitting space-time covariance models and describe a case study based on a new wind-speed data set.
-
-
-
Extreme Value Analysis for Financial Risk Management
Natalia Nolde, and Chen ZhouVol. 8 (2021), pp. 217–240More LessThis article reviews methods from extreme value analysis with applications to risk assessment in finance. It covers three main methodological paradigms: the classical framework for independent and identically distributed data with application to risk estimation for market and operational loss data, the multivariate framework for cross-sectional dependent data with application to systemic risk, and the methods for stationary serially dependent data applied to dynamic risk management. The article is addressed to statisticians with interest and possibly experience in financial risk management who are not familiar with extreme value analysis.
-
-
-
Sparse Structures for Multivariate Extremes
Vol. 8 (2021), pp. 241–270More LessExtreme value statistics provides accurate estimates for the small occurrence probabilities of rare events. While theory and statistical tools for univariate extremes are well developed, methods for high-dimensional and complex data sets are still scarce. Appropriate notions of sparsity and connections to other fields such as machine learning, graphical models, and high-dimensional statistics have only recently been established. This article reviews the new domain of research concerned with the detection and modeling of sparse patterns in rare events. We first describe the different forms of extremal dependence that can arise between the largest observations of a multivariate random vector. We then discuss the current research topics, including clustering, principal component analysis, and graphical modeling for extremes. Identification of groups of variables that can be concomitantly extreme is also addressed. The methods are illustrated with an application to flood risk assessment.
-
-
-
Compositional Data Analysis
Vol. 8 (2021), pp. 271–299More LessCompositional data are nonnegative data carrying relative, rather than absolute, information—these are often data with a constant-sum constraint on the sample values, for example, proportions or percentages summing to 1% or 100%, respectively. Ratios between components of a composition are important since they are unaffected by the particular set of components chosen. Logarithms of ratios (logratios) are the fundamental transformation in the ratio approach to compositional data analysis—all data thus need to be strictly positive, so that zero values present a major problem. Components that group together based on domain knowledge can be amalgamated (i.e., summed) to create new components, and this can alleviate the problem of data zeros. Once compositional data are transformed to logratios, regular univariate and multivariate statistical analysis can be performed, such as dimension reduction and clustering, as well as modeling. Alternative methodologies that come close to the ideals of the logratio approach are also considered, especially those that avoid the problem of data zeros, which is particularly acute in large bioinformatic data sets.
-
-
-
Distance-Based Statistical Inference
Vol. 8 (2021), pp. 301–327More LessStatistical distances, divergences, and similar quantities have an extensive history and play an important role in the statistical and related scientific literature. This role shows up in estimation, where we often use estimators based on minimizing a distance. Distances also play a prominent role in hypothesis testing and in model selection. We review the statistical properties of distances that are often used in scientific work, present their properties, and show how they compare to each other. We discuss an approximation framework for model-based inference using statistical distances. Emphasis is placed on identifying in what sense and which statistical distances can be interpreted as loss functions and used for model assessment. We review a special class of distances, the class of quadratic distances, connect it with the classical goodness-of-fit paradigm, and demonstrate its use in the problem of assessing model fit. These methods can be used in analyzing very large samples.
-
-
-
A Review of Empirical Likelihood
Vol. 8 (2021), pp. 329–344More LessEmpirical likelihood is a popular nonparametric analog of the usual parametric likelihood, inheriting many of the large-sample properties of the latter construct. This article presents a review of the empirical likelihood approach from its introduction 30 years ago, up to recent theoretical developments. Aspects of computation and connections between empirical likelihood and other likelihood-type quantities are also explored. The article ends with a discussion of some directions for future research.
-
-
-
Tensors in Statistics
Xuan Bi, Xiwei Tang, Yubai Yuan, Yanqing Zhang, and Annie QuVol. 8 (2021), pp. 345–368More LessThis article provides an overview of tensors, their properties, and their applications in statistics. Tensors, also known as multidimensional arrays, are generalizations of matrices to higher orders and are useful data representation architectures. We first review basic tensor concepts and decompositions, and then we elaborate traditional and recent applications of tensors in the fields of recommender systems and imaging analysis. We also illustrate tensors for network data and explore the relations among interacting units in a complex network system. Some canonical tensor computational algorithms and available software libraries are provided for various tensor decompositions. Future research directions, including tensors in deep learning, are also discussed.
-
-
-
Flexible Models for Complex Data with Applications
Vol. 8 (2021), pp. 369–391More LessProbability distributions are the building blocks of statistical modeling and inference. It is therefore of the utmost importance to know which distribution to use in what circumstances, as wrong choices will inevitably entail a biased analysis. In this article, we focus on circumstances involving complex data and describe the most popular flexible models for these settings. We focus on the following complex data: multivariate skew and heavy-tailed data, circular data, toroidal data, and cylindrical data. We illustrate the strength of flexible models on the basis of concrete examples and discuss major applications and challenges.
-
-
-
Adaptive Enrichment Designs in Clinical Trials
Vol. 8 (2021), pp. 393–411More LessAdaptive enrichment designs for clinical trials may include rules that use interim data to identify treatment-sensitive patient subgroups, select or compare treatments, or change entry criteria. A common setting is a trial to compare a new biologically targeted agent to standard therapy. An enrichment design's structure depends on its goals, how it accounts for patient heterogeneity and treatment effects, and practical constraints. This article first covers basic concepts, including treatment-biomarker interaction, precision medicine, selection bias, and sequentially adaptive decision making, and briefly describes some different types of enrichment. Numerical illustrations are provided for qualitatively different cases involving treatment-biomarker interactions. Reviews are given of adaptive signature designs; a Bayesian design that uses a random partition to identify treatment-sensitive biomarker subgroups and assign treatments; and designs that enrich superior treatment sample sizes overall or within subgroups, make subgroup-specific decisions, or include outcome-adaptive randomization.
-
-
-
Quantile Regression for Survival Data
Vol. 8 (2021), pp. 413–437More LessQuantile regression offers a useful alternative strategy for analyzing survival data. Compared with traditional survival analysis methods, quantile regression allows for comprehensive and flexible evaluations of covariate effects on a survival outcome of interest while providing simple physical interpretations on the time scale. Moreover, many quantile regression methods enjoy easy and stable computation. These appealing features make quantile regression a valuable practical tool for delivering in-depth analyses of survival data. This article provides a review of a comprehensive set of statistical methods for performing quantile regression with different types of survival data. The review covers various survival scenarios, including randomly censored data, data subject to left truncation or censoring, competing risks and semicompeting risks data, and recurrent events data. Two real-world examples are presented to illustrate the utility of quantile regression for practical survival data analyses.
-
-
-
Statistical Applications in Educational Measurement
Hua-Hua Chang, Chun Wang, and Susu ZhangVol. 8 (2021), pp. 439–461More LessEducational measurement assigns numbers to individuals based on observed data to represent individuals’ educational properties such as abilities, aptitudes, achievements, progress, and performance. The current review introduces a selection of statistical applications to educational measurement, ranging from classical statistical theory (e.g., Pearson correlation and the Mantel–Haenszel test) to more sophisticated models (e.g., latent variable, survival, and mixture modeling) and statistical and machine learning (e.g., high-dimensional modeling, deep and reinforcement learning). Three main subjects are discussed: evaluations for test validity, computer-based assessments, and psychometrics informing learning. Specific topics include item bias detection, high-dimensional latent variable modeling, computerized adaptive testing, response time and log data analysis, cognitive diagnostic models, and individualized learning.
-