- Home
- A-Z Publications
- Annual Review of Statistics and Its Application
- Previous Issues
- Volume 2, 2015
Annual Review of Statistics and Its Application - Volume 2, 2015
Volume 2, 2015
- Preface
-
-
-
Reproducing Statistical Results
Vol. 2 (2015), pp. 1–19More LessThe reproducibility of statistical findings has become a concern not only for statisticians, but for all researchers engaged in empirical discovery. Section 2 of this article identifies key reasons statistical findings may not replicate, including power and sampling issues; misapplication of statistical tests; the instability of findings under reasonable perturbations of data or models; lack of access to methods, data, or equipment; and cultural barriers such as researcher incentives and rewards. Section 3 discusses five proposed remedies for these replication failures: improved prepublication and postpublication validation of findings; the complete disclosure of research steps; assessment of the stability of statistical findings; providing access to digital research objects, in particular data and software; and ensuring these objects are legally reusable.
-
-
-
How to See More in Observational Studies: Some New Quasi-Experimental Devices
Vol. 2 (2015), pp. 21–48More LessIn a well-conducted, slightly idealized, randomized experiment, the only explanation of an association between treatment and outcome is an effect caused by the treatment. However, this is not true in observational studies of treatment effects, in which treatment and outcomes may be associated because of some bias in the assignment of treatments to individuals. When added to the design of an observational study, quasi-experimental devices investigate empirically a particular rival explanation or counterclaim, often attempting to preempt anticipated counterclaims. This review has three parts: a discussion of the often misunderstood logic of quasi-experimental devices; a brief overview of the important work of Donald T. Campbell and his colleagues (excellent expositions of this work have been published elsewhere); and its main topic, descriptions and empirical examples of newer devices, including evidence factors, differential effects, and the computerized construction of quasi-experiments.
-
-
-
Incorporating Both Randomized and Observational Data into a Single Analysis
Vol. 2 (2015), pp. 49–72More LessAlthough both randomized and nonrandomized study data relevant to a question of treatment efficacy are often available and separately analyzed, these data are rarely formally combined in a single analysis. One possible reason for this is the apparent or feared disagreement of effect estimates across designs, which can be attributed both to differences in estimand definition and to analyses that may produce biased estimators. This article reviews specific models and general frameworks that aim to harmonize analyses from the two designs and combine them via a single analysis that ideally exploits the relative strengths of each design. The development of such methods is still in its infancy, and examples of applications with joint analyses are rare. This area would greatly benefit from more attention from researchers in statistical methods and applications.
-
-
-
Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
Vol. 2 (2015), pp. 73–94More LessThe human microbiome is the totality of all microbes in and on the human body, and its importance in health and disease has been increasingly recognized. High-throughput sequencing technologies have recently enabled scientists to obtain an unbiased quantification of all microbes constituting the microbiome. Often, a single sample can produce hundreds of millions of short sequencing reads. However, unique characteristics of the data produced by the new technologies, as well as the sheer magnitude of these data, make drawing valid biological inferences from microbiome studies difficult. Analysis of these big data poses great statistical and computational challenges. Important issues include normalization and quantification of relative taxa, bacterial genes, and metabolic abundances; incorporation of phylogenetic information into analysis of metagenomics data; and multivariate analysis of high-dimensional compositional data. We review existing methods, point out their limitations, and outline future research directions.
-
-
-
Multiset Statistics for Gene Set Analysis
Vol. 2 (2015), pp. 95–111More LessAn important data analysis task in statistical genomics involves the integration of genome-wide gene-level measurements with preexisting data on the same genes. A wide variety of statistical methodologies and computational tools have been developed for this general task. We emphasize one particular distinction among methodologies, namely whether they process gene sets one at a time (uniset) or simultaneously via some multiset technique. Owing to the complexity of collections of gene sets, the multiset approach offers some advantages, as it naturally accommodates set-size variations and among-set overlaps. However, this approach presents both computational and inferential challenges. After reviewing some statistical issues that arise in uniset analysis, we examine two model-based multiset methods for gene list data.
-
-
-
Probabilistic Record Linkage in Astronomy: Directional Cross-Identification and Beyond
Vol. 2 (2015), pp. 113–139More LessModern astronomy increasingly relies upon systematic surveys, whose dedicated telescopes continuously observe the sky across varied wavelength ranges of the electromagnetic spectrum; some surveys also observe nonelectromagnetic messengers, such as high-energy particles or gravitational waves. Stars and galaxies look different through the eyes of different instruments, and their independent measurements have to be carefully combined to provide a complete, sound picture of the multicolor and eventful universe. The association of an object's independent detections is, however, a difficult problem scientifically, computationally, and statistically, raising varied challenges across diverse astronomical applications. The fundamental problem is finding records in survey databases with directions that match to within the direction uncertainties. Such astronomical versions of the record linkage problem are known by various terms in astronomy: cross-matching; cross-identification; and directional, positional, or spatiotemporal coincidence assessment. Astronomers have developed several statistical approaches for such problems, largely independent of related developments in other disciplines. Here, we review emerging approaches that compute (Bayesian) probabilities for the hypotheses of interest: possible associations or demographic properties of a cosmic population that depend on identifying associations. Many cross-identification tasks can be formulated within a hierarchical Bayesian partition model framework, with components that explicitly account for astrophysical effects (e.g., source brightness versus wavelength, source motion, or source extent), selection effects, and measurement error. We survey recent developments and highlight important open areas for future research.
-
-
-
A Framework for Statistical Inference in Astrophysics
Vol. 2 (2015), pp. 141–162More LessThe rapid growth of astronomical data sets, coupled with the complexity of the questions scientists seek to answer with these data, creates an increasing need for the utilization of advanced statistical inference methods in astrophysics. Here, focus is placed on situations in which the underlying objective is the estimation of cosmological parameters, the key physical constants that characterize the Universe. Owing to the complex relationship between these parameters and the observable data, this broad inference goal is best divided into three stages. The primary objective of this article is to describe these stages and thus place into a coherent framework the class of inference problems commonly encountered by those working in this field. Examples of such inference challenges are presented.
-
-
-
Modern Statistical Challenges in High-Resolution Fluorescence Microscopy
Vol. 2 (2015), pp. 163–202More LessConventional light microscopes have been used for centuries for the study of small length scales down to approximately 250 nm. Images from such a microscope are typically blurred and noisy, and the measurement error in such images can often be well approximated by Gaussian or Poisson noise. In the past, this approximation has been the focus of a multitude of deconvolution techniques in imaging. However, conventional microscopes have an intrinsic physical limit of resolution. Although this limit remained unchallenged for a century, it was broken for the first time in the 1990s with the advent of modern superresolution fluorescence microscopy techniques. Since then, superresolution fluorescence microscopy has become an indispensable tool for studying the structure and dynamics of living organisms. Current experimental advances go to the physical limits of imaging, where discrete quantum effects are predominant. Consequently, this technique is inherently of a non-Gaussian statistical nature, and we argue that recent technological progress also challenges the long-standing Poisson assumption. Thus, analysis and exploitation of the discrete physical mechanisms of fluorescent molecules and light, as well as their distributions in time and space, have become necessary to achieve the highest resolution possible. This article presents an overview of some physical principles underlying modern fluorescence microscopy techniques from a statistical modeling and analysis perspective. To this end, we develop a prototypical model for fluorophore dynamics and use it to discuss statistical methods for image deconvolution and more complicated image reconstruction and enhancement techniques. Several examples are discussed in more detail, including variational multiscale methods for confocal and stimulated emission depletion (STED) microscopy, drift correction for single marker switching (SMS) microscopy, and sparse estimation and background removal for superresolution by polarization angle demodulation (SPoD). We illustrate that such methods benefit from advances in large-scale computing, for example, from recent tools from convex optimization. We argue that in the future, even higher resolutions will require more detailed models that delve into sub-Poissonian statistics.
-
-
-
Statistics of Extremes
A.C. Davison, and R. HuserVol. 2 (2015), pp. 203–235More LessStatistics of extremes concerns inference for rare events. Often the events have never yet been observed, and their probabilities must therefore be estimated by extrapolation of tail models fitted to available data. Because data concerning the event of interest may be very limited, efficient methods of inference play an important role. This article reviews this domain, emphasizing current research topics. We first sketch the classical theory of extremes for maxima and threshold exceedances of stationary series. We then review multivariate theory, distinguishing asymptotic independence and dependence models, followed by a description of models for spatial and spatiotemporal extreme events. Finally, we discuss inference and describe two applications. Animations illustrate some of the main ideas.
-
-
-
Multivariate Order Statistics: Theory and Application
Vol. 2 (2015), pp. 237–257More LessThis work revisits several proposals for the ordering of multivariate data via a prescribed depth function. We argue that one of these deserves special consideration, namely, Tukey's halfspace depth, which constructs nested convex sets via intersections of halfspaces. These sets provide a natural generalization of univariate order statistics to higher dimensions and exhibit consistency and asymptotic normality as estimators of corresponding population quantities. For absolutely continuous probability measures in
, we present a connection between halfspace depth and the Radon transform of the density function, which is employed to formalize both the finite-sample and asymptotic probability distributions of the random nested sets. We review multivariate goodness-of-fit statistics based on halfspace depths, which were originally proposed in the projection pursuit literature. Finally, we demonstrate the utility of halfspace ordering as an exploratory tool by studying spatial data on maximum and minimum temperatures produced by a climate simulation model.
-
-
-
Agent-Based Models and Microsimulation
Vol. 2 (2015), pp. 259–272More LessAgent-based models (ABMs) are computational models used to simulate the actions and interactions of agents within a system. Usually, each agent has a relatively simple set of rules for how he or she responds to his or her environment and to other agents. These models are used to gain insight into the emergent behavior of complex systems with many agents, in which the emergent behavior depends upon the micro-level behavior of the individuals. ABMs are widely used in many fields, and this article reviews some of those applications. However, as relatively little work has been done on statistical inference for such models, this article also points out some of those gaps and recent strategies to address them.
-
-
-
Statistical Causality from a Decision-Theoretic Perspective
Vol. 2 (2015), pp. 273–303More LessWe present an overview of the decision-theoretic framework of statistical causality, which is well suited for formulating and solving problems of determining the effects of applied causes. The approach is described in detail, and it is related to and contrasted with other current formulations, such as structural equation models and potential responses. Topics and applications covered include confounding, the effect of treatment on the treated, instrumental variables, and dynamic treatment strategies.
-
-
-
Using Longitudinal Complex Survey Data
Vol. 2 (2015), pp. 305–320More LessCommon features of longitudinal surveys are complex sampling designs, which must be maintained and extended over time; measurement errors, including memory errors; panel conditioning or time-in-sample effects; and dropout or attrition. In the analysis of longitudinal survey data, both the theory of complex samples and the theory of longitudinal data analysis must be combined. This article reviews the purposes of longitudinal surveys and the kinds of analyses that are commonly used to address the questions these surveys are designed to answer. In it, I discuss approaches to incorporating the complex designs in inference, as well as the complications introduced by time-in-sample effects and by nonignorable attrition. I also outline the use and limitations of longitudinal survey data in supporting causal inference and conclude with some summary remarks.
-
-
-
Functional Regression
Vol. 2 (2015), pp. 321–359More LessFunctional data analysis (FDA) involves the analysis of data whose ideal units of observation are functions defined on some continuous domain, and the observed data consist of a sample of functions taken from some population, sampled on a discrete grid. Ramsay & Silverman's (1997) textbook sparked the development of this field, which has accelerated in the past 10 years to become one of the fastest growing areas of statistics, fueled by the growing number of applications yielding this type of data. One unique characteristic of FDA is the need to combine information both across and within functions, which Ramsay and Silverman called replication and regularization, respectively. This article focuses on functional regression, the area of FDA that has received the most attention in applications and methodological development. First, there is an introduction to basis functions, key building blocks for regularization in functional regression methods, followed by an overview of functional regression methods, split into three types: (a) functional predictor regression (scalar-on-function), (b) functional response regression (function-on-scalar), and (c) function-on-function regression. For each, the role of replication and regularization is discussed and the methodological development described in a roughly chronological manner, at times deviating from the historical timeline to group together similar methods. The primary focus is on modeling and methodology, highlighting the modeling structures that have been developed and the various regularization approaches employed. The review concludes with a brief discussion describing potential areas of future development in this field.
-
-
-
Learning Deep Generative Models
Vol. 2 (2015), pp. 361–385More LessBuilding intelligent systems that are capable of extracting high-level representations from high-dimensional sensory data lies at the core of solving many artificial intelligence–related tasks, including object recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires models with deep architectures that involve many layers of nonlinear processing. In this article, we review several popular deep learning models, including deep belief networks and deep Boltzmann machines. We show that (a) these deep generative models, which contain many layers of latent variables and millions of parameters, can be learned efficiently, and (b) the learned high-level feature representations can be successfully applied in many application domains, including visual object recognition, information retrieval, classification, and regression tasks.
-