Regularization is a widely used technique throughout statistics, machine learning, and applied mathematics. Modern applications in science and engineering lead to massive and complex data sets, which motivate the use of more structured types of regularizers. This survey provides an overview of the use of structured regularization in high-dimensional statistics, including regularizers for group-structured and hierarchical sparsity, low-rank matrices, additive and multiplicative matrix decomposition, and high-dimensional nonparametric models. It includes various examples with motivating applications; it also covers key aspects of statistical theory and provides some discussion of efficient algorithms.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Agarwal A, Negahban S, Wainwright MJ. 2012a. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40:52452–82 [Google Scholar]
  2. Agarwal A, Negahban S, Wainwright MJ. 2012b. Noisy matrix decomposition via convex relaxation: optimal rates in high dimensions. Ann. Stat. 40:21171–97 [Google Scholar]
  3. Amini AA, Wainwright MJ. 2009. High-dimensional analysis of semidefinite relaxations for sparse principal component analysis. Ann. Stat. 37(5B):2877–921 [Google Scholar]
  4. Bach F. 2008. Consistency of trace norm minimization. J. Mach. Learn. Res. 9:1019–48 [Google Scholar]
  5. Bach F, Jenatton R, Mairal J, Obozinski G. 2012. Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4:11–106 [Google Scholar]
  6. Baraniuk RG, Cevher V, Duarte MF, Hegde C. 2010. Model-based compressive sensing. IEEE Trans. Inf. Theory 56:41982–2001 [Google Scholar]
  7. Berthet Q, Rigollet P. 2013. Computational lower bounds for sparse PCA Tech. Rep., Princeton Univ., Princeton, NJ. http://arxiv1304.0828
  8. Bertsekas DP. 1995. Nonlinear Programming Belmont, MA: Athena Sci.
  9. Bickel P, Li B. 2006. Regularization in statistics. TEST 15:2271–344 [Google Scholar]
  10. Bickel P, Ritov Y, Tsybakov A. 2009. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37:41705–32 [Google Scholar]
  11. Boyd S, Vandenberghe L. 2004. Convex Optimization Cambridge, UK: Cambridge Univ. Press
  12. Bredies K, Lorenz DA. 2008. Linear convergence of iterative soft thresholding. J. Fourier Anal. Appl. 14:813–37 [Google Scholar]
  13. Bühlmann P, van de Geer S. 2011. Statistics for High-Dimensional Data New York: Springer
  14. Bunea F, Tsybakov A, Wegkamp M. 2007. Sparsity oracle inequalities for the Lasso. Electron. J. Stat.169–94
  15. Cai T, Liu W, Luo X. 2011. A constrained ℓ1-minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106:594–607 [Google Scholar]
  16. Cai TT, Liu W, Zhou HH. 2012. Estimating sparse precision matrices: optimal rates of convergence and adaptive estimation Tech. Rep., Wharton Sch., Univ. Pa., Pennsylvania, PA. http://arxiv1212.2882
  17. Candès EJ, Li X, Ma Y, Wright J. 2011. Robust principal component analysis?. J. ACM 58:11 [Google Scholar]
  18. Candès EJ, Plan Y. 2011. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. IEEE Trans. Inf. Theory 57:42342–59 [Google Scholar]
  19. Candès EJ, Recht B. 2009. Exact matrix completion via convex optimization. Found. Comput. Math. 9:6717–72 [Google Scholar]
  20. Candès EJ, Tao T. 2007. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35:62313–51 [Google Scholar]
  21. Chandrasekaran V, Parrilo PA, Willsky AS. 2012. Latent variable graphical model selection via convex optimization. Ann. Stat. 40:41935–67 [Google Scholar]
  22. Chandrasekaran V, Sanghavi S, Parrilo PA, Willsky AS. 2011. Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21:572–96 [Google Scholar]
  23. Chen S, Donoho DL, Saunders MA. 1998. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20:133–61 [Google Scholar]
  24. Cohen A, Dahmen W, DeVore R. 2009. Compressed sensing and best k-term approximation. J. Am. Math. Soc. 22:1211–31 [Google Scholar]
  25. d'Aspremont A, Banerjee O, El Ghaoui L. 2008. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 30:155–66 [Google Scholar]
  26. Donoho DL. 2006. Compressed sensing. IEEE Trans. Inf. Theory 52:41289–306 [Google Scholar]
  27. Donoho DL, Tanner JM. 2008. Counting faces of randomly-projected polytopes when the projection radically lowers dimension. J. Am. Math. Soc. 22:1–53 [Google Scholar]
  28. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T. 2008. Efficient projections onto the ℓ1 -ball for learning in high dimensions. Presented at Int. Conf. Mach. Learn., 25th, Helsinki, Finland
  29. Fan J, Li R. 2001. Variable selection via non-concave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96:4561348–60 [Google Scholar]
  30. Fazel M. 2002. Matrix rank minimization with applications PhD Thesis, Stanford Univ., Stanford, CA. http://faculty.washington.edu/mfazel/thesis-final.pdf
  31. Friedman J, Hastie T, Tibshirani R. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–41 [Google Scholar]
  32. Fu WJ. 2001. Penalized regression: the bridge versus the Lasso. J. Comput. Graph. Stat. 7:3397–416 [Google Scholar]
  33. Greenshtein E, Ritov Y. 2004. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli 10:971–88 [Google Scholar]
  34. Gross D. 2011. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57:31548–66 [Google Scholar]
  35. Gyorfi L, Kohler M, Krzyzak A, Walk H. 2002. A Distribution-Free Theory of Nonparametric Regression New York: Springer
  36. Hoerl AE, Kennard RW. 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67 [Google Scholar]
  37. Hsu D, Kakade SM, Zhang T. 2011. Robust matrix decomposition with sparse corruptions. IEEE Trans. Inf. Theory 57:117221–34 [Google Scholar]
  38. Huang J, Zhang T. 2010. The benefit of group sparsity. Ann. Stat. 38:41978–2004 [Google Scholar]
  39. Jacob L, Obozinski G, Vert JP. 2009. Group Lasso with overlap and graph Lasso Presented at the 26th Int. Conf. Mach. Learn., Montreal, Canada
  40. Jalali A, Ravikumar P, Sanghavi SS, Ruan C. 2010. A dirty model for multi-task learning. Adv. Neural Inf. Process. Syst. 23:964–72 [Google Scholar]
  41. Javanmard A, Montanari A. 2013. Hypothesis testing in high-dimensional regression under the Gaussian random design model: asymptotic theory Tech. Rep., Stanford Univ., Stanford, CA. http://arxiv1301.4240
  42. Kim Y, Kim J, Kim Y. 2006. Blockwise sparse regression. Stat. Sin. 16:2375–90 [Google Scholar]
  43. Koltchinskii V, Lounici K, Tsybakov AB. 2011. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39:2302–29 [Google Scholar]
  44. Koltchinskii V, Yuan M. 2008. Sparse recovery in large ensembles of kernel machines. Presented at Annu. Conf. Learn. Theory, 21st, Helsinki, Finland.
  45. Koltchinskii V, Yuan M. 2010. Sparsity in multiple kernel learning. Ann. Stat. 38:3660–95 [Google Scholar]
  46. Lam C, Fan J. 2009. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Stat. 37:4254–78 [Google Scholar]
  47. Lin Y, Zhang HH. 2006. Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 34:2272–97 [Google Scholar]
  48. Liu H, Lafferty J, Wasserman L. 2009. The nonparanormal: semiparametric estimation of high-dimensional undirected graphs. J. Mach. Learn. Res. 10:1–37 [Google Scholar]
  49. Loh P, Wainwright MJ. 2012. High-dimensional regression with noisy and missing data: provable guarantees with non-convexity. Ann. Stat. 40:31637–64 [Google Scholar]
  50. Lounici K, Pontil M, Tsybakov AB, van de Geer S. 2011. Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39:42164–204 [Google Scholar]
  51. Mazumder R, Hastie T, Tibshirani R. 2010. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11:2287–322 [Google Scholar]
  52. Meier L, van de Geer S, Buhlmann P. 2009. High-dimensional additive modeling. Ann. Stat. 37:3779–821 [Google Scholar]
  53. Meinshausen N, Bühlmann P. 2006. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34:1436–62 [Google Scholar]
  54. Micchelli CA, Morales JM, Pontil M. 2013. Regularizers for structured sparsity. Adv. Comput. Math. 38455–89
  55. Negahban S, Ravikumar P, Wainwright MJ, Yu B. 2012. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 27:4538–57 [Google Scholar]
  56. Negahban S, Wainwright MJ. 2011a. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39:21069–97 [Google Scholar]
  57. Negahban S, Wainwright MJ. 2011b. Simultaneous support recovery in high-dimensional regression: benefits and perils of ℓ1,∞-regularization. IEEE Trans. Inf. Theory 57:63841–63 [Google Scholar]
  58. Negahban S, Wainwright MJ. 2012. Restricted strong convexity and (weighted) matrix completion: optimal bounds with noise. J. Mach. Learn. Res. 13:1665–97 [Google Scholar]
  59. Nesterov Y. 2004. Introductory Lectures on Convex Optimization New York: Kluwer Acad.
  60. Nesterov Y. 2007. Gradient methods for minimizing composite objective function Tech. Rep. 76, Cent. Oper. Res. Econom., Catholic Univ., Louvain, Belg.
  61. Nesterov Y. 2012. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22:2341–62 [Google Scholar]
  62. Obozinski G, Wainwright MJ, Jordan MI. 2011. Union support recovery in high-dimensional multivariate regression. Ann. Stat. 39:11–47 [Google Scholar]
  63. Oymak S, Jalali A, Fazel M, Eldar YC, Hassibi B. 2012. Simultaneously structured models with applications to sparse and low-rank matrices Tech. Rep., Calif. Inst. Technol., Pasadena, CA. http://arxiv1212.3753
  64. Raskutti G, Wainwright MJ, Yu B. 2010. Restricted eigenvalue conditions for correlated Gaussian designs. J. Mach. Learn. Res. 11:2241–59 [Google Scholar]
  65. Raskutti G, Wainwright MJ, Yu B. 2011. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE Trans. Inf. Theory 57:106976–94 [Google Scholar]
  66. Raskutti G, Wainwright MJ, Yu B. 2012. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res. 12:389–427 [Google Scholar]
  67. Ravikumar P, Liu H, Lafferty J, Wasserman L. 2009. SpAM: sparse additive models. J. R. Stat. Soc. Ser. B 71:51009–30 [Google Scholar]
  68. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. 2011. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 5:935–80 [Google Scholar]
  69. Recht B. 2011. A simpler approach to matrix completion. J. Mach. Learn. Res. 12:3413–30 [Google Scholar]
  70. Recht B, Fazel M, Parrilo P. 2010. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52:3471–501 [Google Scholar]
  71. Rohde A, Tsybakov A. 2011. Estimation of high-dimensional low-rank matrices. Ann. Stat. 39:2887–930 [Google Scholar]
  72. Rothman AJ, Bickel PJ, Levina E, Zhu J. 2008. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2:494–515 [Google Scholar]
  73. Rudelson M, Zhou S. 2012. Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theory 59:3434–47 [Google Scholar]
  74. Srebro N, Alon N, Jaakkola TS. 2005. Generalization error bounds for collaborative prediction with low-rank matrices Presented at Neural Inf. Proc. Syst., 17th, Vancouver
  75. Srebro N, Rennie J, Jaakkola TS. 2004. Maximum-margin matrix factorization Presented at Neural Inf. Proc. Syst., 17th, Vancouver
  76. Stojnic M, Parvaresh F, Hassibi B. 2009. On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Trans. Signal Process. 57:83075–85 [Google Scholar]
  77. Stone CJ. 1982. Optimal global rates of convergence for non-parametric regression. Ann. Stat. 10:41040–53 [Google Scholar]
  78. Stone CJ. 1985. Additive regression and other non-parametric models. Ann. Stat. 13:2689–705 [Google Scholar]
  79. Tibshirani R. 1996. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58:1267–88 [Google Scholar]
  80. Tikhonov AN. 1943. On the stability of inverse problems. C. R. (Doklady) Acad. Sci. SSSR 39:176–79 [Google Scholar]
  81. Tropp JA. 2006. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans. Inf. Theory 52:31030–51 [Google Scholar]
  82. Tropp JA, Gilbert AC, Strauss MJ. 2006. Algorithms for simultaneous sparse approximation. Part I: greedy pursuit. Signal Process 86:572–88 [Google Scholar]
  83. Tseng P. 2001. Convergence of block coordinate descent method for nondifferentiable maximization. J. Opt. Theory Appl. 109:3474–94 [Google Scholar]
  84. Tseng P, Yun S. 2009. A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl. 140:513–35 [Google Scholar]
  85. Turlach B, Venables WN, Wright SJ. 2005. Simultaneous variable selection. Technometrics 27:349–63 [Google Scholar]
  86. van de Geer S. 2000. Empirical Processes in M-Estimation Cambridge, UK: Cambridge Univ. Press
  87. van de Geer S. 2008. High-dimensional generalized linear models and the Lasso. Ann. Stat. 36:614–45 [Google Scholar]
  88. van de Geer S. 2012. Weakly decomposable regularization penalties and structured sparsity Tech. Rep., ETH Zurich, Switz. http://arxiv1204.4813v2
  89. van de Geer S, Buhlmann P. 2009. On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3:1360–92 [Google Scholar]
  90. van de Geer S, Buhlmann P, Ritov Y. 2013. On asymptotically optimal confidence regions and tests for high-dimensional models Tech. Rep., ETH Zurich, Switz. http://arxiv1303.0518
  91. Wainwright MJ. 2009. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55:2183–202 [Google Scholar]
  92. Wu TT, Lange K. 2008. Coordinate descent algorithms for Lasso-penalized regression. Ann. Appl. Stat. 2:1224–44 [Google Scholar]
  93. Xu H, Caramanis C, Sanghavi S. 2012. Robust PCA via outlier pursuit. IEEE Trans. Inf. Theory 58:53047–64 [Google Scholar]
  94. Yuan M. 2010. High-dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11:2261–86 [Google Scholar]
  95. Yuan M, Lin Y. 2006. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 68:49–67 [Google Scholar]
  96. Zhang CH, Huang J. 2008. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Stat. 36:41567–94 [Google Scholar]
  97. Zhang CH, Zhang SS. 2011. Confidence intervals for low-dimensional parameters with high-dimensional data Tech. Rep., Rutgers Univ., New Brunswick, NJ. http://arxiv1110.2563
  98. Zhang CH, Zhang T. 2012. A general theory of concave regularization for high-dimensional sparse estimation problems. Stat. Sci. 27:4576–93 [Google Scholar]
  99. Zhao P, Rocha G, Yu B. 2009. Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. 37:6A3468–97 [Google Scholar]
  100. Zhao P, Yu B. 2006. On model selection consistency of Lasso. J. Mach. Learn. Res. 7:2541–67 [Google Scholar]
  101. Zhou S, Lafferty J, Wasserman L. 2008. Time-varying undirected graphs. Presented at Annu. Conf. Learn. Theory, 21st, Helsinki

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error