1932

Abstract

Data sets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data, and diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset preserve the features of the original data? We review a line of work grounded in theoretical computer science and numerical linear algebra that finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of the data, a property known as subspace embedding. Building on this work, we study how prediction and inference can be affected by data sketching within a linear regression setup. We use statistical arguments to provide “inference-conscious” guides to the sketch size and show that an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-economics-022720-114138
2020-08-02
2024-05-02
Loading full text...

Full text loading...

/deliver/fulltext/economics/12/1/annurev-economics-022720-114138.html?itemId=/content/journals/10.1146/annurev-economics-022720-114138&mimeType=html&fmt=ahah

Literature Cited

  1. Achiloptas D. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66:4671–87
    [Google Scholar]
  2. Agarwal PK, Har-Peled S, Varadarajan KR 2004. Approximating extent measures of points. J. Assoc. Comput. Mach. 51:4606–35
    [Google Scholar]
  3. Ahfock D, Astle W, Richardson S 2017. Statistical properties of sketching algorithms. arXiv:1706.03665 [stat.ME]
  4. Ailon N, Chazelle B. 2009. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39:1302–22
    [Google Scholar]
  5. Alon N, Matias Y, Onak K 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58:1137–47
    [Google Scholar]
  6. Bai Z, Yin Y. 1993. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21:31275–94
    [Google Scholar]
  7. Belenzon S, Chatterji A, Dailey B 2017. Eponymous enterpreneurs. Am. Econ. Rev. 107:61638–55
    [Google Scholar]
  8. Boivin J, Ng S. 2006. Are more data always better for factor analysis. J. Econom. 132:169–94
    [Google Scholar]
  9. Boutidis C, Gittens A. 2013. Improved matrix algorithms via the Subsampled Randomized Hadamard Transform. SIAM J. Matrix Anal. 34:31301–40
    [Google Scholar]
  10. Breiman L. 1999. Pasting bites together for prediction in large data sets and on-line. Mach. Learn. 36:285–103
    [Google Scholar]
  11. Charikar M, Chen K, Farach-Colton M 2002. Finding frequent items in data streams. Proceedings of International Colloquium on Automata, Languages, and Programming693–703 Rome: EATCS
    [Google Scholar]
  12. Chawla N, Hall L, Bowyer K, Kegelmeyer P 2004. Learning ensembles from bites: a scalable and accurate approach. J. Mach. Learn. Res. 5:421–51
    [Google Scholar]
  13. Chen S, Varma R, Singh A, Kovacevic J 2016. A statistical perspective of sampling scores for linear regression Paper presented at the IEEE International Symposium on Information Theory Barcelona: July 10–15
  14. Chi J, Ipsen I. 2018. Randomized least squares regression: combining model and algorithm induced uncertainties. arXiv:1808.05924v1 [stat.ML]
  15. Christmann A, Steinwart I, Hubert M 2007. Robust learning from bites for data mining. Comput. Stat. Data Anal. 52:347–61
    [Google Scholar]
  16. Clarkson K, Woodruff D. 2013. Low rank approximation and regression in input sparsity time. Proceedings of the 45th ACM Symposium on the Theory of Computing81–90 New York: ACM
    [Google Scholar]
  17. Cohen M, Lee Y, Musco C, Musco C, Peng R, Sidford A 2015. Uniform sampling for matrix approximation. Proceedings of the 46th ACM Symposium on the Theory of Computing181–90 New York: ACM
    [Google Scholar]
  18. Cohen M, Nelson J, Woodruff D 2015. Optimal approximate matrix product in terms of stable rank. arXiv:1507.02268 [cs.DS]
  19. Cormode G, Garofalakis M, Haas PJ, Jermaine C 2011. Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Datab. 4:1–294
    [Google Scholar]
  20. Cormode G, Muthukrishnan S. 2005. An improved data stream summary: the count-min sketch and applications. J. Algorithms 55:29–38
    [Google Scholar]
  21. Cramer JS. 1987. Mean and variance of in small and moderate samples. J. Econom. 35:253–66
    [Google Scholar]
  22. Dahiya Y, Konomis D, Woodruff D 2018. An empirical evaluation of sketching for numerical linear algebra Paper presented at the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining London: Aug 19–23
  23. Dasgupta A, Kumar R, Sarlos T 2010. A sparse Johnson–Lindenstrauss transform. arXiv:1004.4240 [cs.DS]
  24. Deaton A, Ng S. 1998. Parametric and nonparametric approaches to tax reform. J. Am. Stat. Assoc. 93:443900–9
    [Google Scholar]
  25. Dhillon P, Lu Y, Foster D, Ungar L 2013. New subsampling algorithms for faster least squares regression. Adv. Neural Inform. Proc. Syst. 26:360–68
    [Google Scholar]
  26. Drineas P, Kannan R, Mahoney M 2006. Fast Monte Carlo algorithms for matrices I: approximating matrix multiplications. SIAM J. Comput. 36:132–57
    [Google Scholar]
  27. Drineas P, Magdon-Ismail M, Mahoney M, Woodruff D 2012. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13:3441–72
    [Google Scholar]
  28. Drineas P, Mahoney M. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6:2152–25
    [Google Scholar]
  29. Drineas P, Mahoney M, Muthukrishnan S 2006. Sampling algorithms for L2 regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms1127–36 New York: ACM
    [Google Scholar]
  30. Drineas P, Mahoney M, Muthukrishnan S, Sarlos T 2011. Faster least squares approximation. Numer. Math. 117:219–49
    [Google Scholar]
  31. Du Mouchel W, Volinsky C, Johnson T, Cortes C, Pregibon D 1999. Squashing flat files flatter. Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining6–15 New York: ACM
    [Google Scholar]
  32. Eriksson-Bique S, Solberg M, Stefanelli M, Warkentin S, Abbey R, Ipsen I 2011. Importance sampling for a Monte Carlo matrix multiplication algorithm with application to information retrieval. SIAM J. Comput. 33:41689–706
    [Google Scholar]
  33. Geppert L, Ickstadt K, Munteanu A, Quedenfeld J, Sohler C 2017. Random projections for Bayesian regression. Stat. Comput. 27:79–101
    [Google Scholar]
  34. Ghashami M, Liberty E, Phillips M, Woodruff D 2016. Frequent directions: simple and deterministic matrix sketching. SIAM J. Comput. 45:51762–92
    [Google Scholar]
  35. Hansen BE. 2020. Econometrics Textb., Univ Wisconsin, Madison: https://www.ssc.wisc.edu/˜bhansen/econometrics/Econometrics.pdf
  36. Heince C, McWilliams B, Meinshausen N 2016. Dual loco: distributing statistical estimation using random projections. Proc. Mach. Learn. Res. 51:875–83
    [Google Scholar]
  37. Hogben L. 2007. Handbook of Linear Algebra London: Chapman & Hall
  38. Horvitz D, Thompson D. 1952. A generalization of sampling replacement from a finite universe. J. Am. Stat. Assoc. 47:663–85
    [Google Scholar]
  39. Ipsen I, Wentworth T. 2014. The effect of coherence on sampling from matrices with orthonormal columns and preconditioned least squares problems. SIAM J. Matrix Anal. Appl. 35:41490–520
    [Google Scholar]
  40. Johnson W, Lindenstrauss J. 1994. Extensions of Lipschitz maps into a Hilbert space. Contemp. Math. 26:189–206
    [Google Scholar]
  41. Jolliffe I. 1972. Discarding variables in a principal component analysis: artificial data. Appl. Stat. 21:2160–73
    [Google Scholar]
  42. Kane D, Nelson J. 2014. Sparser Johnson-Lindenstrauss transforms. J. Assoc. Comput. Mach. 61:14
    [Google Scholar]
  43. Li P, Hastie T, Church K 2006. Very sparse random projections. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining287–96 New York: ACM
    [Google Scholar]
  44. Ma P, Mahoney MW, Yu B 2014. A statistical perspective on algorithmic leveraging. Proc. Mach. Learn. Res. 32:191–99
    [Google Scholar]
  45. Madigan D, Raghavan N, Dumouchel W, Nason M, Posse C, Ridgeway G 1999. Likelihood-based data squashing: a modeling approach to instance construction Tech. Rep., AT&T Labs Res Florham Park, NJ:
  46. Mahoney MW. 2011. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning 32123–224 Delft, Neth.: Now Publ.
    [Google Scholar]
  47. McWilliams B, Krummenacher C, Lucic G, Buhmann J 2014. Fast and robust least squares estimation in corrupted linear models. Proceedings of the 27th International Conference on Neural Information Processing Systems 1415–23 New York: ACM
    [Google Scholar]
  48. Meng X, Mahoney M. 2013. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. Proceedings of the 45th ACM Symposium on the Theory of Computing91–100 New York: ACM
    [Google Scholar]
  49. Mitzenmacher M, Upfal E. 2006. Probability and Computing: Randomized Algorithms and Probabilistic Analysis Cambridge, UK: Cambridge Univ. Press
  50. Nelson J, Nguyen H. 2013a. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science117–26 Piscataway, NJ: IEEE
    [Google Scholar]
  51. Nelson J, Nguyen H. 2013b. Sparsity lower bounds for dimensionality reducing maps. Proceedings of the 45th ACM Symposium on the Theory of Computing101–10 New York: ACM
    [Google Scholar]
  52. Nelson J, Nguyen H. 2014. Lower bounds for oblivious subspace embeddings. Proceedings of the 41st International Colloquium on Automata, Languages and Programming883–94 Rome: EATCS
    [Google Scholar]
  53. Ng S. 2017. Opportunities and challenges: lessons from analyzing terabytes of scanner data. Advances in Economics and Econometrics: Eleventh World Congress of the Econometric Society Vol II B Honore, A Pkes, M Piazzesi, L Samuelson 1–34 Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  54. Owen A. 1990. Empirical likelihood ratio confidence region. Ann. Stat. 18:90–120
    [Google Scholar]
  55. Pilanci M, Wainwright M. 2015. Randomized sketches of convex programs with sharp guarantees. IEEE Trans. Inform. Theory 61:95096–115
    [Google Scholar]
  56. Pilanci M, Wainwright M. 2016. Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17:138
    [Google Scholar]
  57. Portnoy S, Koenker R. 1997. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimation. Stat. Sci. 12:4279–300
    [Google Scholar]
  58. Raskutti G, Mahoney M. 2016. A statistical perspective on randomized sketching for ordinary least squares. J. Mach. Learn. Res. 17:1–38
    [Google Scholar]
  59. Rudd P. 2000. An Introduction to Classical Econometric Theory Oxford, UK: Oxford Univ. Press
  60. Ruggles S, Flood S, Goeken R, Grover J, Meyer E et al. 2020. IPUMS USA: Version 10.0 [dataset]. IPUMS Minneapolis, MN: https://doi.org/10.18128/D010.V10.0
    [Crossref]
  61. Sarlos T. 2006. Improved approximation algorithms for large matrices via random projections. Proceedings of the 47 IEEE Symposium on Foundations of Computer Science143–52 Washington, DC: IEEE Comp. Soc.
    [Google Scholar]
  62. Wallace T. 1972. Weaker criteria and tests for linear restrictions. Econometrica 40:4689–98
    [Google Scholar]
  63. Wang H, Yang M, Stufken J 2019. Information-based optimal subdata selection for big data linear research. J. Am. Stat. Assoc. 114:525393–405
    [Google Scholar]
  64. Wang H, Zhu R, Ma P 2018. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113:522849–44
    [Google Scholar]
  65. Wang J, Lee J, Mahdav M, Kolar M, Srebo N 2017. Sketching meets random projection in the dual: a provable recovery algorithm for big and high dimensional data. Electron. J. Stat. 11:4896–944
    [Google Scholar]
  66. Wang S, Gittens A, Mahoney M 2018. Sketched ridge regression: optimization perspective, statistical perspective, and model averaging. J. Mach. Learn. Res. 18:1–50
    [Google Scholar]
  67. Woodruff D. 2014. Sketching as a tool for numerical linear algebra. Found. Trends Theor. Comput. Sci. 10:1–21–157
    [Google Scholar]
  68. Woolfe F, Liberty E, Vladmir R, Mark T 2008. A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal. 25:3335–66
    [Google Scholar]
  69. Yin Y, Bai Z, Krishnaiah P 1988. On the limit of the largest eigenvalue of the largest dimensional sample covariance matrix. Probab. Theory Relat. Fields 78:4509–21
    [Google Scholar]
/content/journals/10.1146/annurev-economics-022720-114138
Loading
/content/journals/10.1146/annurev-economics-022720-114138
Loading

Data & Media loading...

Supplemental Material

Supplementary Data

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error