An Econometric Perspective on Algorithmic Subsampling

Sokbae Lee; Serena Ng

doi:10.1146/annurev-economics-022720-114138

An Econometric Perspective on Algorithmic Subsampling

Sokbae Lee¹, and Serena Ng¹
View Affiliations Hide Affiliations

Affiliations: ¹Department of Economics, Columbia University, New York, NY 10027, USA; email: [email protected] ²Institute for Fiscal Studies, London WC1E 7AE, United Kingdom ³National Bureau of Economic Research, Cambridge, Massachusetts 02138, USA
Vol. 12:45-80 (Volume publication date August 2020) https://doi.org/10.1146/annurev-economics-022720-114138
First published as a Review in Advance on June 01, 2020
Copyright © 2020 by Annual Reviews. All rights reserved

Abstract

Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.

Keyword(s): coresets, countsketch, JEL C2, JEL C3, sketching, subspace embedding, uniform sampling

Article metrics loading...

/content/journals/10.1146/annurev-economics-022720-114138

2020-08-02

2024-05-02

Full text loading...

/deliver/fulltext/economics/12/1/annurev-economics-022720-114138.html?itemId=/content/journals/10.1146/annurev-economics-022720-114138&mimeType=html&fmt=ahah

Literature Cited

Achiloptas D. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66:4671–87
[Google Scholar]
Agarwal PK, Har-Peled S, Varadarajan KR 2004. Approximating extent measures of points. J. Assoc. Comput. Mach. 51:4606–35
[Google Scholar]
Ahfock D, Astle W, Richardson S 2017. Statistical properties of sketching algorithms. arXiv:1706.03665 [stat.ME]
Ailon N, Chazelle B. 2009. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39:1302–22
[Google Scholar]
Alon N, Matias Y, Onak K 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58:1137–47
[Google Scholar]
Bai Z, Yin Y. 1993. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21:31275–94
[Google Scholar]
Belenzon S, Chatterji A, Dailey B 2017. Eponymous enterpreneurs. Am. Econ. Rev. 107:61638–55
[Google Scholar]
Boivin J, Ng S. 2006. Are more data always better for factor analysis. J. Econom. 132:169–94
[Google Scholar]
Boutidis C, Gittens A. 2013. Improved matrix algorithms via the Subsampled Randomized Hadamard Transform. SIAM J. Matrix Anal. 34:31301–40
[Google Scholar]
Breiman L. 1999. Pasting bites together for prediction in large data sets and on-line. Mach. Learn. 36:285–103
[Google Scholar]
Charikar M, Chen K, Farach-Colton M 2002. Finding frequent items in data streams. Proceedings of International Colloquium on Automata, Languages, and Programming693–703 Rome: EATCS
[Google Scholar]
Chawla N, Hall L, Bowyer K, Kegelmeyer P 2004. Learning ensembles from bites: a scalable and accurate approach. J. Mach. Learn. Res. 5:421–51
[Google Scholar]
Chen S, Varma R, Singh A, Kovacevic J 2016. A statistical perspective of sampling scores for linear regression Paper presented at the IEEE International Symposium on Information Theory Barcelona: July 10–15
Chi J, Ipsen I. 2018. Randomized least squares regression: combining model and algorithm induced uncertainties. arXiv:1808.05924v1 [stat.ML]
Christmann A, Steinwart I, Hubert M 2007. Robust learning from bites for data mining. Comput. Stat. Data Anal. 52:347–61
[Google Scholar]
Clarkson K, Woodruff D. 2013. Low rank approximation and regression in input sparsity time. Proceedings of the 45th ACM Symposium on the Theory of Computing81–90 New York: ACM
[Google Scholar]
Cohen M, Lee Y, Musco C, Musco C, Peng R, Sidford A 2015. Uniform sampling for matrix approximation. Proceedings of the 46th ACM Symposium on the Theory of Computing181–90 New York: ACM
[Google Scholar]
Cohen M, Nelson J, Woodruff D 2015. Optimal approximate matrix product in terms of stable rank. arXiv:1507.02268 [cs.DS]
Cormode G, Garofalakis M, Haas PJ, Jermaine C 2011. Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Datab. 4:1–294
[Google Scholar]
Cormode G, Muthukrishnan S. 2005. An improved data stream summary: the count-min sketch and applications. J. Algorithms 55:29–38
[Google Scholar]
Cramer JS. 1987. Mean and variance of in small and moderate samples. J. Econom. 35:253–66
[Google Scholar]
Dahiya Y, Konomis D, Woodruff D 2018. An empirical evaluation of sketching for numerical linear algebra Paper presented at the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining London: Aug 19–23
Dasgupta A, Kumar R, Sarlos T 2010. A sparse Johnson–Lindenstrauss transform. arXiv:1004.4240 [cs.DS]
Deaton A, Ng S. 1998. Parametric and nonparametric approaches to tax reform. J. Am. Stat. Assoc. 93:443900–9
[Google Scholar]
Dhillon P, Lu Y, Foster D, Ungar L 2013. New subsampling algorithms for faster least squares regression. Adv. Neural Inform. Proc. Syst. 26:360–68
[Google Scholar]
Drineas P, Kannan R, Mahoney M 2006. Fast Monte Carlo algorithms for matrices I: approximating matrix multiplications. SIAM J. Comput. 36:132–57
[Google Scholar]
Drineas P, Magdon-Ismail M, Mahoney M, Woodruff D 2012. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13:3441–72
[Google Scholar]
Drineas P, Mahoney M. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6:2152–25
[Google Scholar]
Drineas P, Mahoney M, Muthukrishnan S 2006. Sampling algorithms for L2 regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms1127–36 New York: ACM
[Google Scholar]
Drineas P, Mahoney M, Muthukrishnan S, Sarlos T 2011. Faster least squares approximation. Numer. Math. 117:219–49
[Google Scholar]
Du Mouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D 1999. Squashing flat files flatter. Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining6–15 New York: ACM
[Google Scholar]
Eriksson-Bique S, Solberg M, Stefanelli M, Warkentin S, Abbey R, Ipsen I 2011. Importance sampling for a Monte Carlo matrix multiplication algorithm with application to information retrieval. SIAM J. Comput. 33:41689–706
[Google Scholar]
Geppert L, Ickstadt K, Munteanu A, Quedenfeld J, Sohler C 2017. Random projections for Bayesian regression. Stat. Comput. 27:79–101
[Google Scholar]
Ghashami M, Liberty E, Phillips M, Woodruff D 2016. Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45:51762–92
[Google Scholar]
Hansen BE. 2020. Econometrics Textb., Univ Wisconsin, Madison: https://www.ssc.wisc.edu/˜bhansen/econometrics/Econometrics.pdf
Heince C, McWilliams B, Meinshausen N 2016. Dual loco: distributing statistical estimation using random projections. Proc. Mach. Learn. Res. 51:875–83
[Google Scholar]
Hogben L. 2007. Handbook of Linear Algebra London: Chapman & Hall
Horvitz D, Thompson D. 1952. A generalization of sampling replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85
[Google Scholar]
Ipsen I, Wentworth T. 2014. The effect of coherence on sampling from matrices with orthonormal columns and preconditioned least squares problems. SIAM J. Matrix Anal. Appl. 35:41490–520
[Google Scholar]
Johnson W, Lindenstrauss J. 1994. Extensions of Lipschitz maps into a Hilbert space. Contemp. Math. 26:189–206
[Google Scholar]
Jolliffe I. 1972. Discarding variables in a principal component analysis: artificial data. Appl. Stat. 21:2160–73
[Google Scholar]
Kane D, Nelson J. 2014. Sparser Johnson-Lindenstrauss transforms. J. Assoc. Comput. Mach. 61:14
[Google Scholar]
Li P, Hastie T, Church K 2006. Very sparse random projections. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining287–96 New York: ACM
[Google Scholar]
Ma P, Mahoney MW, Yu B 2014. A statistical perspective on algorithmic leveraging. Proc. Mach. Learn. Res. 32:191–99
[Google Scholar]
Madigan D, Raghavan N, Dumouchel W, Nason M, Posse C, Ridgeway G 1999. Likelihood-based data squashing: a modeling approach to instance construction Tech. Rep., AT&T Labs Res Florham Park, NJ:
Mahoney MW. 2011. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 32123–224 Delft, Neth.: Now Publ.
[Google Scholar]
McWilliams B, Krummenacher C, Lucic G, Buhmann J 2014. Fast and robust least squares estimation in corrupted linear models. Proceedings of the 27th International Conference on Neural Information Processing Systems 1415–23 New York: ACM
[Google Scholar]
Meng X, Mahoney M. 2013. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. Proceedings of the 45th ACM Symposium on the Theory of Computing91–100 New York: ACM
[Google Scholar]
Mitzenmacher M, Upfal E. 2006. Probability and Computing: Randomized Algorithms and Probabilistic Analysis Cambridge, UK: Cambridge Univ. Press
Nelson J, Nguyen H. 2013a. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science117–26 Piscataway, NJ: IEEE
[Google Scholar]
Nelson J, Nguyen H. 2013b. Sparsity lower bounds for dimensionality reducing maps. Proceedings of the 45th ACM Symposium on the Theory of Computing101–10 New York: ACM
[Google Scholar]
Nelson J, Nguyen H. 2014. Lower bounds for oblivious subspace embeddings. Proceedings of the 41st International Colloquium on Automata, Languages and Programming883–94 Rome: EATCS
[Google Scholar]
Ng S. 2017. Opportunities and challenges: lessons from analyzing terabytes of scanner data. Advances in Economics and Econometrics: Eleventh World Congress of the Econometric Society Vol II B Honore, A Pkes, M Piazzesi, L Samuelson 1–34 Cambridge, UK: Cambridge Univ. Press
[Google Scholar]
Owen A. 1990. Empirical likelihood ratio confidence region. Ann. Stat. 18:90–120
[Google Scholar]
Pilanci M, Wainwright M. 2015. Randomized sketches of convex programs with sharp guarantees. IEEE Trans. Inform. Theory 61:95096–115
[Google Scholar]
Pilanci M, Wainwright M. 2016. Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17:1–38
[Google Scholar]
Portnoy S, Koenker R. 1997. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimation. Stat. Sci. 12:4279–300
[Google Scholar]
Raskutti G, Mahoney M. 2016. A statistical perspective on randomized sketching for ordinary least squares. J. Mach. Learn. Res. 17:1–38
[Google Scholar]
Rudd P. 2000. An Introduction to Classical Econometric Theory Oxford, UK: Oxford Univ. Press
Ruggles S, Flood S, Goeken R, Grover J, Meyer E et al. 2020. IPUMS USA: Version 10.0 [dataset]. IPUMS Minneapolis, MN: https://doi.org/10.18128/D010.V10.0
[Crossref]
Sarlos T. 2006. Improved approximation algorithms for large matrices via random projections. Proceedings of the 47 IEEE Symposium on Foundations of Computer Science143–52 Washington, DC: IEEE Comp. Soc.
[Google Scholar]
Wallace T. 1972. Weaker criteria and tests for linear restrictions. Econometrica 40:4689–98
[Google Scholar]
Wang H, Yang M, Stufken J 2019. Information-based optimal subdata selection for big data linear research. J. Am. Stat. Assoc. 114:525393–405
[Google Scholar]
Wang H, Zhu R, Ma P 2018. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113:522849–44
[Google Scholar]
Wang J, Lee J, Mahdav M, Kolar M, Srebo N 2017. Sketching meets random projection in the dual: a provable recovery algorithm for big and high dimensional data. Electron. J. Stat. 11:4896–944
[Google Scholar]
Wang S, Gittens A, Mahoney M 2018. Sketched ridge regression: optimization perspective, statistical perspective, and model averaging. J. Mach. Learn. Res. 18:1–50
[Google Scholar]
Woodruff D. 2014. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci. 10:1–21–157
[Google Scholar]
Woolfe F, Liberty E, Vladmir R, Mark T 2008. A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal. 25:3335–66
[Google Scholar]
Yin Y, Bai Z, Krishnaiah P 1988. On the limit of the largest eigenvalue of the largest dimensional sample covariance matrix. Probab. Theory Relat. Fields 78:4509–21
[Google Scholar]

/content/journals/10.1146/annurev-economics-022720-114138

An Econometric Perspective on Algorithmic Subsampling

Annual Review of Economics 12, 45 (2020); https://doi.org/10.1146/annurev-economics-022720-114138

/content/journals/10.1146/annurev-economics-022720-114138

Data & Media loading...

Supplemental Material

Supplementary Data

Download Supplemental Appendices A-F (PDF).

Article Type: Review Article

Most Cited Most Cited RSS feed

- Power Laws in Economics and Finance
  
  Xavier Gabaix
  
  Vol. 1 (2009), pp. 255–294
- The Gravity Model
  
  James E. Anderson
  
  Vol. 3 (2011), pp. 133–160
- Microeconomics of Technology Adoption
  
  Andrew D. Foster, and Mark R. Rosenzweig
  
  Vol. 2 (2010), pp. 395–424
- The China Shock: Learning from Labor-Market Adjustment to Large Changes in Trade
  
  David H. Autor, David Dorn, and Gordon H. Hanson
  
  Vol. 8 (2016), pp. 205–240
- Financial Literacy, Financial Education, and Economic Outcomes
  
  Justine S. Hastings, Brigitte C. Madrian, and William L. Skimmyhorn
  
  Vol. 5 (2013), pp. 347–373
- Gender and Competition
  
  Muriel Niederle, and Lise Vesterlund
  
  Vol. 3 (2011), pp. 601–630
- Corruption in Developing Countries
  
  Benjamin A. Olken, and Rohini Pande
  
  Vol. 4 (2012), pp. 479–509
- The Economics of Human Development and Social Mobility
  
  James J. Heckman, and Stefano Mosso
  
  Vol. 6 (2014), pp. 689–733
- The Roots of Gender Inequality in Developing Countries
  
  Seema Jayachandran
  
  Vol. 7 (2015), pp. 63–88
- The Consumption Response to Income Changes
  
  Tullio Jappelli, and Luigi Pistaferri
  
  Vol. 2 (2010), pp. 479–506
More Less

Annual Review of Economics

Volume 12, 2020

Review Article

Free

An Econometric Perspective on Algorithmic Subsampling

Abstract

Supplementary Data

Most Read This Month

Most Cited Most Cited RSS feed

Power Laws in Economics and Finance

The Gravity Model

Microeconomics of Technology Adoption

The China Shock: Learning from Labor-Market Adjustment to Large Changes in Trade

Financial Literacy, Financial Education, and Economic Outcomes

Gender and Competition

Corruption in Developing Countries

The Economics of Human Development and Social Mobility

The Roots of Gender Inequality in Developing Countries

The Consumption Response to Income Changes