- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 4, 2017
Annual Review of Statistics and Its Application - Volume 4, 2017
Volume 4, 2017
-
-
p-Values: The Insight to Modern Statistical Inference
Vol. 4 (2017), pp. 1–14More LessI introduce a p-value function that derives from the continuity inherent in a wide range of regular statistical models. This provides confidence bounds and confidence sets, tests, and estimates that all reflect model continuity. The development starts with the scalar-variable scalar-parameter exponential model and extends to the vector-parameter model with scalar interest parameter, then to general regular models, and then references for testing vector interest parameters are available. The procedure does not use sufficiency but applies directly to general models, although it reproduces sufficiency-based results when sufficiency is present. The emphasis is on the coherence of the full procedure, and technical details are not emphasized.
-
-
-
Curriculum Guidelines for Undergraduate Programs in Data Science
Richard D. De Veaux, Mahesh Agarwal, Maia Averett, Benjamin S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant, Lei Z. Cheng, Amanda Francis, Robert Gould, Albert Y. Kim, Matt Kretchmar, Qin Lu, Ann Moskol, Deborah Nolan, Roberto Pelayo, Sean Raleigh, Ricky J. Sethi, Mutiara Sondjaja, Neelesh Tiruviluamala, Paul X. Uhlig, Talitha M. Washington, Curtis L. Wesley, David White, and Ping YeVol. 4 (2017), pp. 15–30More LessThe Park City Math Institute 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in data science. The group consisted of 25 undergraduate faculty from a variety of institutions in the United States, primarily from the disciplines of mathematics, statistics, and computer science. These guidelines are meant to provide some structure for institutions planning for or revising a major in data science.
-
-
-
Risk and Uncertainty Communication
Vol. 4 (2017), pp. 31–60More LessThis review briefly examines the vast range of techniques used to communicate risk assessments arising from statistical analysis. After discussing essential psychological and sociological issues, I focus on individual health risks and relevant research on communicating numbers, verbal expressions, graphics, and conveying deeper uncertainty. I then consider practice in a selection of diverse case studies, including gambling, the benefits and risks of pharmaceuticals, weather forecasting, natural hazards, climate change, environmental exposures, security and intelligence, industrial reliability, and catastrophic national and global risks. There are some tentative final conclusions, but the primary message is to acknowledge expert guidance, be clear about objectives, and work closely with intended audiences.
-
-
-
Exposed! A Survey of Attacks on Private Data
Vol. 4 (2017), pp. 61–84More LessPrivacy-preserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, re-identification attacks have shown that it is easy to link supposedly de-identified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof. We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individual's data are included in the dataset. We also discuss techniques from the differential privacy literature for releasing approximate aggregate statistics while provably thwarting any privacy attack.
-
-
-
The Evolution of Data Quality: Understanding the Transdisciplinary Origins of Data Quality Concepts and Approaches
Vol. 4 (2017), pp. 85–108More LessData, and hence data quality, transcend all boundaries of science, commerce, engineering, medicine, public health, and policy. Data quality has historically been addressed by controlling the measurement processes, controlling the data collection processes, and through data ownership. For many data sources being leveraged into data science, this approach to data quality may be challenged. To understand that challenge, a historical and disciplinary perspective on data quality, highlighting the evolution and convergence of data concepts and applications, is presented.
-
-
-
Is Most Published Research Really False?
Vol. 4 (2017), pp. 109–122More LessThere has been an increasing concern in both the scientific and lay communities that most published medical findings are false. But what does it mean to be false? Here we describe the range of definitions of false discoveries in the scientific literature. We summarize the philosophical, statistical, and experimental evidence for each type of false discovery. We discuss common underpinning problems with the scientific and data analytic practices and point to tools and behaviors that can be implemented to reduce the problems with published scientific results.
-
-
-
Understanding and Assessing Nutrition
Vol. 4 (2017), pp. 123–146More LessMost countries collect short-term food consumption information of individuals on a regular basis. These data, after much analysis and interpretation, are used to assess the nutritional status of population subgroups, design food assistance programs, guide nutritional and food policy, and—in epidemiological applications—uncover associations between diet and health. In this review, we focus on surveillance, a broad term that includes, for example, estimation of nutritional status and evaluation of the adequacy of the diet. From a statistical viewpoint, dietary intake and evaluation questions pose tremendous methodological challenges. Nutrient and food adequacy are defined in terms of long-term intakes, yet we can only practically observe short-term consumption, perhaps over one or two days. Food consumption measurements are noisy and subject to both systematic and random error, and in addition, there are very large day-to-day differences in a person's food consumption. Observed distributions of food and nutrient intake tend to be skewed, with long tails to the right and (in the case of episodically consumed items) with a mass at zero that can represent a large proportion of the distribution. We review the literature on this topic and describe some of the newest questions and proposed methodological solutions. The focus is on the use of large national food consumption surveys to address public policy and research questions, but much of what we discuss is applicable in a broader context.
-
-
-
Hazard Rate Modeling of Step-Stress Experiments
Maria Kateri, and Udo KampsVol. 4 (2017), pp. 147–168More LessStep-stress models form an essential part of accelerated life testing procedures. Under a step-stress model, the test units are exposed to stress levels that increase at intermediate time points of the experiment. The goal is to develop statistical inference for, e.g., the mean lifetime under each stress level, targeting to the extrapolation under normal operating conditions. This is achieved through an appropriate link function that connects the stress level to the associated mean lifetime. The assumptions made about the time points of stress level change, the termination point of the experiment, the underlying lifetime distributions, the type of censoring (if present), and the way of monitoring lead to alternative models. Step-stress models can be designed for single or multiple samples. We discuss recent developments in designing and analyzing step-stress models based on hazard rates. The inference approach adopted is mainly the maximum likelihood, but Bayesian approaches are briefly discussed.
-
-
-
Online Analysis of Medical Time Series
Vol. 4 (2017), pp. 169–188More LessComplex, often high-dimensional time series are observed in medical applications such as intensive care. We review statistical tools for intelligent alarm systems, which are helpful for guiding medical decision-making in time-critical situations. The procedures described can also be applied for decision support or in closed-loop controllers. Robust time series filters allow one to extract a signal in the form of a time-varying trend with little or no delay. Additional rules—based, for instance, on suitably designed statistical tests—can be incorporated to preserve or detect interesting patterns such as level shifts or trend changes. Statistical pattern detection is a useful preprocessing step for decision-support systems. Dimension reduction techniques allow the compression of the often high-dimensional time series into a few variables containing most of the information inherent in the observed data. Combining such techniques with tools for analyzing the relationships among the variables in the form of large partial correlations or similar trend behavior improves the interpretability of the extracted variables and provides information that is thus meaningful to physicians.
-
-
-
Statistical Methods for Large Ensembles of Super-Resolution Stochastic Single Particle Trajectories in Cell Biology
Vol. 4 (2017), pp. 189–223More LessFollowing recent progress in super-resolution microscopy in the past decade, massive amounts of redundant single stochastic trajectories are now available for statistical analysis. Flows of trajectories of molecules or proteins sample the cell membrane or its interior at very high time and space resolution. Several statistical analyses were developed to extract information contained in these data, such as the biophysical parameters of the underlying stochastic motion to reveal the cellular organization. These trajectories can further reveal hidden subcellular organization. We review here the statistical analysis of these trajectories based on the classical Langevin equation, which serves as a model of trajectories. Parametric and nonparametric estimators are constructed by discretizing the stochastic equations, and they allow the recovery of tethering forces, diffusion tensors, or membrane organization from measured trajectories that differ from physical ones by a localization noise. Modeling, data analysis, and automatic detection algorithms serve to extract novel biophysical features such as potential wells and other substructures, such as rings, at an unprecedented spatiotemporal resolution. It is also possible to reconstruct the surface membrane of a biological cell from the statistics of projected random trajectories.
-
-
-
Statistical Issues in Forensic Science
Vol. 4 (2017), pp. 225–244More LessForensic science refers to the use of scientific methods in a legal context. Several recent events, especially the release in 2009 of the National Research Council (NRC) report Strengthening Forensic Science in the United States: A Path Forward, have raised concerns about the methods used to analyze forensic evidence and the ways in which forensic evidence is interpreted and reported on in court. The NRC report identified challenges including the lack of resources in many jurisdictions compared with the amount of evidence requiring processing, the lack of standardization across laboratories and practitioners, and questions about the analysis, interpretation and presentation of evidence. With respect to the last, the NRC report raises questions about the underlying scientific foundation for forensic examinations on some evidence types. Statistics has emerged as a key discipline for helping the forensic science community address these challenges. The standard elements of statistical analysis—study design, data collection, data analysis, statistical inference, and summarizing and reporting inferences—are all relevant. This article reviews the role of forensic evidence, the heterogeneity of forensic domains, current practices and their limitations, and the potential contributions of more rigorous statistical methods, especially Bayesian approaches and the likelihood ratio, in the analysis, interpretation, and reporting of forensic evidence.
-
-
-
Bayesian Modeling and Analysis of Geostatistical Data
Vol. 4 (2017), pp. 245–266More LessThe most prevalent spatial data setting is, arguably, that of so-called geostatistical data, data that arise as random variables observed at fixed spatial locations. Collection of such data in space and in time has grown enormously in the past two decades. With it has grown a substantial array of methods to analyze such data. Here, we attempt a review of a fully model-based perspective for such data analysis, the approach of hierarchical modeling fitted within a Bayesian framework. The benefit, as with hierarchical Bayesian modeling in general, is full and exact inference, with proper assessment of uncertainty. Geostatistical modeling includes univariate and multivariate data collection at sites, continuous and categorical data at sites, static and dynamic data at sites, and datasets over very large numbers of sites and long periods of time. Within the hierarchical modeling framework, we offer a review of the current state of the art in these settings.
-
-
-
Modeling Through Latent Variables
Vol. 4 (2017), pp. 267–282More LessIn this review, we give a general overview of latent variable models. We introduce the general model and discuss various inferential approaches. Afterward, we present several commonly applied special cases, including mixture or latent class models, as well as mixed models. We apply many of these models to a single data set with simple structure, allowing for easy comparison of the results. This allows us to discuss advantages and disadvantages of the various approaches, but also to illustrate several problems inherently linked to models incorporating latent structures. Finally, we touch on model extensions and applications and highlight several issues often ignored when applying latent variable models.
-
-
-
Two-Part and Related Regression Models for Longitudinal Data
V.T. Farewell, D.L. Long, B.D.M. Tom, S. Yiu, and L. SuVol. 4 (2017), pp. 283–315More LessStatistical models that involve a two-part mixture distribution are applicable in a variety of situations. Frequently, the two parts are a model for the binary response variable and a model for the outcome variable that is conditioned on the binary response. Two common examples are zero-inflated or hurdle models for count data and two-part models for semicontinuous data. Recently, there has been particular interest in the use of these models for the analysis of repeated measures of an outcome variable over time. The aim of this review is to consider motivations for the use of such models in this context and to highlight the central issues that arise with their use. We examine two-part models for semicontinuous and zero-heavy count data, and we also consider models for count data with a two-part random effects distribution.
-
-
-
Some Recent Developments in Statistics for Spatial Point Patterns
Vol. 4 (2017), pp. 317–342More LessThis article reviews developments in statistics for spatial point processes obtained within roughly the past decade. These developments include new classes of spatial point process models such as determinantal point processes, models incorporating both regularity and aggregation, and models where points are randomly distributed around latent geometric structures. Regarding parametric inference, the main focus is on various types of estimating functions derived from so-called innovation measures. Optimality of such estimating functions is discussed, as well as computational issues. Maximum likelihood inference for determinantal point processes and Bayesian inference are also briefly considered. Concerning nonparametric inference, we consider extensions of functional summary statistics to the case of inhomogeneous point processes as well as new approaches to simulation-based inference.
-
-
-
Stochastic Actor-Oriented Models for Network Dynamics
Vol. 4 (2017), pp. 343–363More LessThis article discusses the stochastic actor-oriented model for analyzing panel data of networks. The model is defined as a continuous-time Markov chain, observed at two or more discrete time moments. It can be regarded as a generalized linear model with a large amount of missing data. Several estimation methods are discussed. After presenting the model for evolution of networks, attention is given to coevolution models. These use the same approach of a continuous-time Markov chain observed at a small number of time points, but now with an extended state space. The state space can be, for example, the combination of a network and nodal variables, or a combination of several networks. This leads to models for the dynamics of multivariate networks. The article emphasizes the approach to modeling and algorithmic issues for estimation; some attention is given to comparison with other models.
-
-
-
Structure Learning in Graphical Modeling
Vol. 4 (2017), pp. 365–393More LessA graphical model is a statistical model that is associated with a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models have computationally convenient factorization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimating the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields) and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks). We further review extensions that account for effects of latent variables and heterogeneous data sources.
-
-
-
Bayesian Computing with INLA: A Review
Vol. 4 (2017), pp. 395–421More LessThe key operation in Bayesian inference is to compute high-dimensional integrals. An old approximate technique is the Laplace method or approximation, which dates back to Pierre-Simon Laplace (1774). This simple idea approximates the integrand with a second-order Taylor expansion around the mode and computes the integral analytically. By developing a nested version of this classical idea, combined with modern numerical techniques for sparse matrices, we obtain the approach of integrated nested Laplace approximations (INLA) to do approximate Bayesian inference for latent Gaussian models (LGMs). LGMs represent an important model abstraction for Bayesian inference and include a large proportion of the statistical models used today. In this review, we discuss the reasons for the success of the INLA approach, the R-INLA package, why it is so accurate, why the approximations are very quick to compute, and why LGMs make such a useful concept for Bayesian computing.
-
-
-
Global Testing and Large-Scale Multiple Testing for High-Dimensional Covariance Structures
Vol. 4 (2017), pp. 423–446More LessDriven by a wide range of contemporary applications, statistical inference for covariance structures has been an active area of current research in high-dimensional statistics. This review provides a selective survey of some recent developments in hypothesis testing for high-dimensional covariance structures, including global testing for the overall pattern of the covariance structures and simultaneous testing of a large collection of hypotheses on the local covariance structures with false discovery proportion and false discovery rate control. Both one-sample and two-sample settings are considered. The specific testing problems discussed include global testing for the covariance, correlation, and precision matrices, and multiple testing for the correlations, Gaussian graphical models, and differential networks.
-
-
-
The Energy of Data
Vol. 4 (2017), pp. 447–479More LessThe energy of data is the value of a real function of distances between data in metric spaces. The name energy derives from Newton's gravitational potential energy, which is also a function of distances between physical objects. One of the advantages of working with energy functions (energy statistics) is that even if the data are complex objects, such as functions or graphs, we can use their real-valued distances for inference. Other advantages are illustrated and discussed in this review. Concrete examples include energy testing for normality, energy clustering, and distance correlation. Applications include genome studies, brain studies, and astrophysics. The direct connection between energy and mind/observations/data in this review is a counterpart of the equivalence of energy and matter/mass in Einstein's E=mc2.
-