- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 6, 2019
Annual Review of Statistics and Its Application - Volume 6, 2019
Volume 6, 2019
-
-
Stephen Elliott Fienberg 1942–2016, Founding Editor of the Annual Review of Statistics and Its Application
Vol. 6 (2019), pp. 1–18More LessStephen Elliott Fienberg was the founding editor of the Annual Review of Statistics and Its Application. Steve had an outsized personality and a passion for statistical science that was quite unique, and he combined these with his legendary energy to provide a remarkable level of leadership for the statistical science community, and a sweeping vision of the importance of statistical arguments for science, health and policy. The editorial team of the Annual Review of Statistics and Its Application is working hard to carry on his legacy for the journal. In this article we highlight some of his contributions through the voices of his students and collaborators. It is by no means a comprehensive assessment of his scholarship, but we hope it provides a window into his impact and influence on several generations of scholars. As Reid & Stigler (2017) wrote in Volume 4, “his lasting imprint on the science of statistics and its application defies simple categorization.”
-
-
-
Historical Perspectives and Current Directions in Hockey Analytics
Vol. 6 (2019), pp. 19–36More LessWe review recent advances in hockey analytics research, most of which have occurred from the early 2000s to the present day. We discuss these advances in the context of earlier attempts to evaluate player performance in hockey. We survey the unique challenges of quantitatively summarizing the game of hockey, and how deficiencies in existing methods of evaluation shaped major avenues of research and the creation of new metrics. We present an extended analysis of the National Hockey League entry draft in terms of both retrospective evaluation and prospective strategy. We conclude with recommendations for future research in hockey analytics.
-
-
-
Experiments in Criminology: Improving Our Understanding of Crime and the Criminal Justice System
Vol. 6 (2019), pp. 37–61More LessCrime is costly, yet we understand little about it. The United States justice system costs $280 billion per year, but compared to other areas, such as medicine and agriculture, we have few answers for the field's fundamental questions, like what causes crime and how we can best use our justice system to respond to it. In addition, the success or failure of the justice system impacts our safety, freedoms, and trust in government. Criminologists are working to bridge this gap in knowledge using methods that are fundamentally statistical, including randomized designs, case-control studies, instrumental variables, and natural experiments. This review discusses how criminologists explore the police, courts, sentencing, and communities and their effect on crime using daylight saving time, natural disasters, coding errors, quirks in funding formulas, and other phenomena to simulate randomization. I include analyses of racial bias, police shootings, public defense, parolees, graffiti, vacant lots, and abandoned buildings. This review should encourage statisticians to bring their methods and expertise to bear on criminological questions, as the field needs broader and deeper scientific examination.
-
-
-
Using Statistics to Assess Lethal Violence in Civil and Inter-State War
Patrick Ball, and Megan PriceVol. 6 (2019), pp. 63–84More LessWhat role can statistics play in assessing the patterns of lethal violence in conflict? This article highlights the evolution of statistical applications in assessing lethal violence, from the presentation of data in the Nuremberg trials to current questions around machine learning and training data. We present examples from work conducted by our organization, the Human Rights Data Analysis Group, and others, primarily researching killings in the context of civil wars and international conflict. The primary challenge we encounter in this work is the question of whether observed patterns of violence represent the true underlying pattern or are a reflection of reports of violence, which are subject to many sources of bias. This is where we find the foundations of twentieth-century statistics to be most important: Is this sample representative? What methods are best suited to reduce the bias in nonprobability samples? These questions lead us to the approaches presented here: multiple systems estimation, surveys, complete data, and the question of bias within training data for machine learning models. We close with memories of Steve Fienberg's influence on these questions and on us personally. “It's all inference,” he told us, and that insight informs our concerns about bias in data used to create historical memory and advance justice in the wake of mass violence.
-
-
-
Differential Privacy and Federal Data Releases
Vol. 6 (2019), pp. 85–101More LessFederal statistics agencies strive to release data products that are informative for many purposes, yet also protect the privacy and confidentiality of data subjects’ identities and sensitive attributes. This article reviews the role that differential privacy, a disclosure risk criterion developed in the cryptography community, can and does play in federal data releases. The article describes potential benefits and limitations of using differential privacy for federal data, reviews current federal data products that satisfy differential privacy, and outlines research needed for adoption of differential privacy to become widespread among federal agencies.
-
-
-
Evaluation of Causal Effects and Local Structure Learning of Causal Networks
Zhi Geng, Yue Liu, Chunchen Liu, and Wang MiaoVol. 6 (2019), pp. 103–124More LessCausal effect evaluation and causal network learning are two main research areas in causal inference. For causal effect evaluation, we review the two problems of confounders and surrogates. The Yule-Simpson paradox is the idea that the association between two variables may be changed dramatically due to ignoring confounders. We review criteria for confounders and methods of adjustment for observed and unobserved confounders. The surrogate paradox occurs when a treatment has a positive causal effect on a surrogate endpoint, which, in turn, has a positive causal effect on a true endpoint, but the treatment may have a negative causal effect on the true endpoint. Some of the existing criteria for surrogates are subject to the surrogate paradox, and we review criteria for consistent surrogates to avoid the surrogate paradox. Causal networks are used to depict the causal relationships among multiple variables. Rather than discovering a global causal network, researchers are often interested in discovering the causes and effects of a given variable. We review some algorithms for local structure learning of causal networks centering around a given variable.
-
-
-
Handling Missing Data in Instrumental Variable Methods for Causal Inference
Vol. 6 (2019), pp. 125–148More LessIn instrumental variable studies, missing instrument data are very common. For example, in the Wisconsin Longitudinal Study, one can use genotype data as a Mendelian randomization–style instrument, but this information is often missing when subjects do not contribute saliva samples or when the genotyping platform output is ambiguous. Here we review missing at random assumptions one can use to identify instrumental variable causal effects, and discuss various approaches for estimation and inference. We consider likelihood-based methods, regression and weighting estimators, and doubly robust estimators. The likelihood-based methods yield the most precise inference and are optimal under the model assumptions, while the doubly robust estimators can attain the nonparametric efficiency bound while allowing flexible nonparametric estimation of nuisance functions (e.g., instrument propensity scores). The regression and weighting estimators can sometimes be easiest to describe and implement. Our main contribution is an extensive review of this wide array of estimators under varied missing-at-random assumptions, along with discussion of asymptotic properties and inferential tools. We also implement many of the estimators in an analysis of the Wisconsin Longitudinal Study, to study effects of impaired cognitive functioning on depression.
-
-
-
Nonprobability Sampling and Causal Analysis
Vol. 6 (2019), pp. 149–172More LessThe long-standing approach of using probability samples in social science research has come under pressure through eroding survey response rates, advanced methodology, and easier access to large amounts of data. These factors, along with an increased awareness of the pitfalls of the nonequivalent comparison group design for the estimation of causal effects, have moved the attention of applied researchers away from issues of sampling and toward issues of identification. This article discusses the usability of samples with unknown selection probabilities for various research questions. In doing so, we review assumptions necessary for descriptive and causal inference and discuss research strategies developed to overcome sampling limitations.
-
-
-
Agricultural Crop Forecasting for Large Geographical Areas
Vol. 6 (2019), pp. 173–196More LessCrop forecasting is important to national and international trade and food security. Although sample surveys continue to have a role in many national crop forecasting programs, the increasing challenges of list frame undercoverage, declining response rates, increasing response burden, and increasing costs are leading government agencies to replace some or all of survey data with data from other sources. This article reviews the primary approaches currently being used to produce official statistics, including surveys, remote sensing, and the integration of these with meteorological, administrative, or other data. The research opportunities for improving current methods of forecasting crop yield and quantifying the uncertainty associated with the prediction are highlighted.
-
-
-
Statistical Models of Key Components of Wildfire Risk
Vol. 6 (2019), pp. 197–222More LessFire danger systems have evolved from qualitative indices, to process-driven deterministic models of fire behavior and growth, to data-driven stochastic models of fire occurrence and simulation systems. However, there has often been little overlap or connectivity in these frameworks, and validation has not been common in deterministic models. Yet, marked increases in annual fire costs, losses, and fatality costs over the past decade draw attention to the need for better understanding of fire risk to support fire management decision making through the use of science-backed, data-driven tools. Contemporary risk modeling systems provide a useful integrative framework. This article discusses a variety of important contributions for modeling fire risk components over recent decades, certain key fire characteristics that have been overlooked, and areas of recent research that may enhance risk models.
-
-
-
An Overview of Joint Modeling of Time-to-Event and Longitudinal Outcomes
Vol. 6 (2019), pp. 223–240More LessIn this review, we present an overview of joint models for longitudinal and time-to-event data. We introduce a generalized formulation for the joint model that incorporates multiple longitudinal outcomes of varying types. We focus on extensions for the parametrization of the association structure that links the longitudinal and time-to-event outcomes, estimation techniques, and dynamic predictions. We also outline the software available for the application of these models.
-
-
-
Self-Controlled Case Series Methodology
Vol. 6 (2019), pp. 241–261More LessThe self-controlled case series method is an epidemiological study design in which individuals act as their own control. The method offers advantages but has several limitations. This article outlines the self-controlled case series method and reviews methodological developments that address some of these limitations.
-
-
-
Precision Medicine
Vol. 6 (2019), pp. 263–286More LessPrecision medicine seeks to maximize the quality of health care by individualizing the health-care process to the uniquely evolving health status of each patient. This endeavor spans a broad range of scientific areas including drug discovery, genetics/genomics, health communication, and causal inference, all in support of evidence-based, i.e., data-driven, decision making. Precision medicine is formalized as a treatment regime that comprises a sequence of decision rules, one per decision point, which map up-to-date patient information to a recommended action. The potential actions could be the selection of which drug to use, the selection of dose, the timing of administration, the recommendation of a specific diet or exercise, or other aspects of treatment or care. Statistics research in precision medicine is broadly focused on methodological development for estimation of and inference for treatment regimes that maximize some cumulative clinical outcome. In this review, we provide an overview of this vibrant area of research and present important and emerging challenges.
-
-
-
Sentiment Analysis
Vol. 6 (2019), pp. 287–308More LessSentiment analysis labels a body of text as expressing either a positive or negative opinion, as in summarizing the content of an online product review. In this sense, sentiment analysis can be considered the challenge of building a classifier from text. Sentiment analysis can be done by counting the words from a dictionary of emotional terms, by fitting traditional classifiers such as logistic regression to word counts, or, most recently, by employing sophisticated neural networks. These methods progressively improve classification at the cost of increased computation and reduced transparency. A common sentiment analysis task, the classification of IMDb (Internet Movie Database) movie reviews, is used to illustrate the methods on a common task that appears frequently in the literature.
-
-
-
Statistical Methods for Naturalistic Driving Studies
Vol. 6 (2019), pp. 309–328More LessThe naturalistic driving study (NDS) is an innovative research method characterized by the continuous recording of driving information using advanced instrumentation under real-world driving conditions. NDSs provide opportunities to assess driving risks that are difficult to evaluate using traditional crash database or experimental methods. NDS findings have profound impacts on driving safety research, safety countermeasures development, and public policy. NDSs also come with attendant challenges to statistical analysis, however, due to the sheer volume of data collected, complex structure, and high cost associated with information extraction. This article reviews statistical and analytical methods for working with NDS data. Topics include the characteristics of NDSs; NDS data components; and epidemiological approaches for video-based risk modeling, including case-cohort and case-crossover study designs, logistic models, Poisson models, and recurrent event models. The article also discusses several key issues related to NDS analysis, such as crash surrogates and alternative reference exposure levels.
-
-
-
Model-Based Learning from Preference Data
Vol. 6 (2019), pp. 329–354More LessPreference data occur when assessors express comparative opinions about a set of items, by rating, ranking, pair comparing, liking, or clicking. The purpose of preference learning is to (a) infer on the shared consensus preference of a group of users, sometimes called rank aggregation, or (b) estimate for each user her individual ranking of the items, when the user indicates only incomplete preferences; the latter is an important part of recommender systems. We provide an overview of probabilistic approaches to preference learning, including the Mallows, Plackett–Luce, and Bradley–Terry models and collaborative filtering, and some of their variations. We illustrate, compare, and discuss the use of these methods by means of an experiment in which assessors rank potatoes, and with a simulation. The purpose of this article is not to recommend the use of one best method but to present a palette of different possibilities for different questions and different types of data.
-
-
-
Finite Mixture Models
Vol. 6 (2019), pp. 355–378More LessThe important role of finite mixture models in the statistical analysis of data is underscored by the ever-increasing rate at which articles on mixture applications appear in the statistical and general scientific literature. The aim of this article is to provide an up-to-date account of the theory and methodological developments underlying the applications of finite mixture models. Because of their flexibility, mixture models are being increasingly exploited as a convenient, semiparametric way in which to model unknown distributional shapes. This is in addition to their obvious applications where there is group-structure in the data or where the aim is to explore the data for such structure, as in a cluster analysis. It has now been three decades since the publication of the monograph by McLachlan & Basford (1988) with an emphasis on the potential usefulness of mixture models for inference and clustering. Since then, mixture models have attracted the interest of many researchers and have found many new and interesting fields of application. Thus, the literature on mixture models has expanded enormously, and as a consequence, the bibliography here can only provide selected coverage.
-
-
-
Approximate Bayesian Computation
Vol. 6 (2019), pp. 379–403More LessMany of the statistical models that could provide an accurate, interesting, and testable explanation for the structure of a data set turn out to have intractable likelihood functions. The method of approximate Bayesian computation (ABC) has become a popular approach for tackling such models. This review gives an overview of the method and the main issues and challenges that are the subject of current research.
-
-
-
Statistical Aspects of Wasserstein Distances
Vol. 6 (2019), pp. 405–431More LessWasserstein distances are metrics on probability distributions inspired by the problem of optimal mass transportation. Roughly speaking, they measure the minimal effort required to reconfigure the probability mass of one distribution in order to recover the other distribution. They are ubiquitous in mathematics, with a long history that has seen them catalyze core developments in analysis, optimization, and probability. Beyond their intrinsic mathematical richness, they possess attractive features that make them a versatile tool for the statistician: They can be used to derive weak convergence and convergence of moments, and can be easily bounded; they are well-adapted to quantify a natural notion of perturbation of a probability distribution; and they seamlessly incorporate the geometry of the domain of the distributions in question, thus being useful for contrasting complex objects. Consequently, they frequently appear in the development of statistical theory and inferential methodology, and they have recently become an object of inference in themselves. In this review, we provide a snapshot of the main concepts involved in Wasserstein distances and optimal transportation, and a succinct overview of some of their many statistical aspects.
-