1932

Abstract

Online learning is a framework for the design and analysis of algorithms that build predictive models by processing data one at the time. Besides being computationally efficient, online algorithms enjoy theoretical performance guarantees that do not rely on statistical assumptions on the data source. In this review, we describe some of the most important algorithmic ideas behind online learning and explain the main mathematical tools for their analysis. Our reference framework is online convex optimization, a sequential version of convex optimization within which most online algorithms are formulated. More specifically, we provide an in-depth description of online mirror descent and follow the regularized leader, two of the most fundamental algorithms in online learning. As the tuning of parameters is a typically difficult task in sequential data analysis, in the last part of the review we focus on coin-betting, an information-theoretic approach to the design of parameter-free online algorithms with good theoretical guarantees.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040620-035329
2021-03-07
2024-10-05
Loading full text...

Full text loading...

/deliver/fulltext/statistics/8/1/annurev-statistics-040620-035329.html?itemId=/content/journals/10.1146/annurev-statistics-040620-035329&mimeType=html&fmt=ahah

Literature Cited

  1. Abernethy J, Hazan E, Rakhlin A 2008. Competing in the dark: an efficient algorithm for bandit linear optimization. Proceedings of the 21st Annual Conference on Learning Theory263–73 Madison, WI: Omnipress
    [Google Scholar]
  2. Agarwal N, Gonen A, Hazan E 2019. Learning in non-convex games with an optimization oracle. Proceedings of the 32nd Annual Conference on Learning Theory Brookline, MA: Microtome http://proceedings.mlr.press/v99/agarwal19a/agarwal19a.pdf
    [Google Scholar]
  3. Arora R, Dekel O, Tewari A 2012a. Online bandit learning against an adaptive adversary: from regret to policy regret. Proceedings of the 29th International Conference on Machine Learning Madison, WI: Omnipress http://icml.cc/2012/papers/749.pdf
    [Google Scholar]
  4. Arora S, Hazan E, Kale S 2012b. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput 8:121–64
    [Google Scholar]
  5. Azoury KS, Warmuth MK. 2001. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 43:211–46
    [Google Scholar]
  6. Beck A, Teboulle M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31:167–75
    [Google Scholar]
  7. Berger JO. 2013. Statistical Decision Theory and Bayesian Analysis New York: Springer
    [Google Scholar]
  8. Besbes O, Gur Y, Zeevi A 2015. Nonstationary stochastic optimization. Oper. Res. 63:1227–44
    [Google Scholar]
  9. Blackwell D. 1956. An analog of the minimax theorem for vector payoffs. Pac. J. Math. 6:1–8
    [Google Scholar]
  10. Block HD. 1962. The perceptron: a model for brain functioning. Rev. Mod. Phys. 34:123
    [Google Scholar]
  11. Blum A, Mansour Y. 2007. From external to internal regret. J. Mach. Learn. Res. 8:1307–24
    [Google Scholar]
  12. Borodin A, El-Yaniv R. 2005. Online Computation and Competitive Analysis Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  13. Bubeck S. 2015. Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8:231–357
    [Google Scholar]
  14. Cesa-Bianchi N, Freund Y, Haussler D, Helmbold DP, Schapire RE, Warmuth MK 1997. How to use expert advice. J. ACM 44:427–85
    [Google Scholar]
  15. Cesa-Bianchi N, Lugosi G. 2006. Prediction, Learning, and Games Cambridge, UK: Cambridge Univ. Press
    [Google Scholar]
  16. Chaudhuri K, Freund Y, Hsu DJ 2009. A parameter-free hedging algorithm. Advances in Neural Information Processing Systems 22 Y Bengio, D Schuurmans, JD Lafferty, CKI Williams, A Culotta 297–305 Red Hook, NY: Curran
    [Google Scholar]
  17. Chernov A, Vovk V. 2010. Prediction with advice of unknown number of experts. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence P Grunwald, P Spirtes 117–25 Arlington, VA: AUAI
    [Google Scholar]
  18. Chiang CK, Yang T, Lee CJ, Mahdavi M, Lu CJ et al. 2012. Online optimization with gradual variations. JMLR Worksh. Conf. Proc. 23:6.1–20
    [Google Scholar]
  19. Cortes C, Vapnik V. 1995. Support-vector networks. Mach. Learn. 20:273–97
    [Google Scholar]
  20. Cover T. 1967. Behaviour of sequential predictors of binary sequences. Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes263–72 Prague: Czechoslov. Acad. Sci.
    [Google Scholar]
  21. Cover TM. 1974. Universal gambling schemes and the complexity measures of Kolmogorov and Chaitin Tech. Rep. 12 Dep. Stat., Stanford Univ Stanford, CA:
    [Google Scholar]
  22. Cutkosky A, Orabona F. 2018. Black-box reductions for parameter-free online learning in Banach spaces. Proc. Mach. Learn. Res. 75:1–37
    [Google Scholar]
  23. Dalalyan A, Tsybakov AB. 2008. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72:39–61
    [Google Scholar]
  24. Dalalyan AS, Salmon J. 2012. Sharp oracle inequalities for aggregation of affine estimators. Ann. Stat. 40:2327–55
    [Google Scholar]
  25. Daniely A, Gonen A, Shalev-Shwartz S 2015. Strongly adaptive online learning. Proc. Mach. Learn. Res. 37:1405–11
    [Google Scholar]
  26. De Rosa R, Orabona F, Cesa-Bianchi N 2015. The ABACOC algorithm: a novel approach for nonparametric classification of data streams. 2015 IEEE International Conference on Data Mining733–38 New York: IEEE
    [Google Scholar]
  27. Duchi J, Hazan E, Singer Y 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:2121–59
    [Google Scholar]
  28. Feder M. 1991. Gambling using a finite state machine. IEEE Trans. Inf. Theory 37:1459–65
    [Google Scholar]
  29. Feder M, Merhav N, Gutman M 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38:1258–70
    [Google Scholar]
  30. Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos E 2016. Learning in games: robustness of fast convergence. Advances in Neural Information Processing Systems 29 DD Lee, U von Luxburg, R Garnett, M Sugiyama, I Guyon 4734–42 Red Hook, NY: Curran
    [Google Scholar]
  31. Freund Y, Schapire RE. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55:119–39
    [Google Scholar]
  32. Freund Y, Schapire RE. 1999. Large margin classification using the perceptron algorithm. Mach. Learn. 37:277–96
    [Google Scholar]
  33. Freund Y, Schapire RE, Singer Y, Warmuth MK 1997. Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing334–43 New York: ACM
    [Google Scholar]
  34. Fudenberg D, Levine DK. 1995. Consistency and cautious fictitious play. J. Econ. Dyn. Control 19:1065–89
    [Google Scholar]
  35. Gentile C, Littlestone N. 1999. The robustness of the p-norm algorithms. Proceedings of the 12th Annual Conference on Computational Learning Theory1–11 New York: ACM
    [Google Scholar]
  36. Hannan J. 1957. Approximation to Bayes risk in repeated play. Contrib. Theory Games 3:97–139
    [Google Scholar]
  37. Hazan E. 2016. Introduction to online convex optimization. Found. Trends Optim. 2:157–325
    [Google Scholar]
  38. Hazan E, Agarwal A, Kale S 2007. Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69:169–92
    [Google Scholar]
  39. Hazan E, Megiddo N. 2007. Online learning with prior knowledge. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007 NH Bshouty, C Gentile 499–513 New York: Springer
    [Google Scholar]
  40. Hazan E, Seshadhri C. 2007. Adaptive algorithms for online decision problems. Electron. Colloq. Comput. Complexity 14:88
    [Google Scholar]
  41. Herbster M, Warmuth MK. 1998a. Tracking the best expert. Mach. Learn. 32:151–78
    [Google Scholar]
  42. Herbster M, Warmuth MK. 1998b. Tracking the best regressor. Proceedings of the 11th Annual Conference on Computational Learning Theory24–31 New York: ACM
    [Google Scholar]
  43. Hoerl AE, Kennard RW. 2000. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 42:80–86
    [Google Scholar]
  44. Jun KS, Orabona F, Wright S, Willett R 2017. Improved strongly adaptive online learning using coin betting. Proc. Mach. Learn. Res. 54:943–51
    [Google Scholar]
  45. Kalai A, Vempala S. 2005. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71:291–307
    [Google Scholar]
  46. Kelly J. 1956. A new interpretation of information rate. Bell Syst. Technol. J. 35:917–26
    [Google Scholar]
  47. Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inform. Comput. 132:1–63
    [Google Scholar]
  48. Kivinen J, Warmuth MK. 1999. Averaging expert predictions. EuroCOLT 1999: Computational Learning Theory P Fischer, HU Simon 153–67 New York: Springer
    [Google Scholar]
  49. Koolen WM, Van Erven T 2015. Second-order quantile methods for experts and combinatorial games. Proc. Mach. Learn. Res. 40:1764–66
    [Google Scholar]
  50. Krichevsky R, Trofimov V. 1981. The performance of universal encoding. IEEE Trans. Inform. Theory 27:199–207
    [Google Scholar]
  51. Kulis B, Bartlett PL. 2010. Implicit online learning. Proceedings of the 27th International Conference on Machine Learning J Fürnkranz, T Joachims 575–82 Madison, WI: Omnipress
    [Google Scholar]
  52. Kuzborskij I, Cesa-Bianchi N. 2017. Nonparametric online regression while learning the metric. Advances in Neural Information Processing Systems 30 I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus et al.667–76 Red Hook, NY: Curran
    [Google Scholar]
  53. Littlestone N, Warmuth MK. 1994. The weighted majority algorithm. Inform. Comput. 108:212–61
    [Google Scholar]
  54. Luo H, Agarwal A, Cesa-Bianchi N, Langford J 2016. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems 29 DD Lee, M Sugiyama, UV Luxburg, I Guyon, R Garnett 902–10 Red Hook, NY: Curran
    [Google Scholar]
  55. McMahan HB. 2017. A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18:1–50
    [Google Scholar]
  56. McMahan HB, Streeter MJ. 2010. Adaptive bound optimization for online convex optimization. Proceedings of the 23rd Annual Conference on Learning Theory AT Kalai, M Mohri 244–56 Madison, WI: Omnipress
    [Google Scholar]
  57. Mokhtari A, Ozdaglar A, Pattathil S 2019. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: proximal point approach. arXiv:1901.08511 [math.OC]
  58. Nemirovsky AS, Yudin DB. 1983. Problem Complexity and Method Efficiency in Optimization New York: Wiley
    [Google Scholar]
  59. Novikoff AB. 1963. On convergence proofs for perceptrons Tech. rep., SRI Menlo Park, CA:
    [Google Scholar]
  60. Orabona F. 2019. A modern introduction to online learning. arXiv:1912.13213 [cs.LG]
  61. Orabona F, Pál D. 2016. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems 29 DD Lee, M Sugiyama, UV Luxburg, I Guyon, R Garnett 577–85 Red Hook, NY: Curran
    [Google Scholar]
  62. Orabona F, Pál D. 2018. Scale-free online learning. Theor. Comput. Sci. 716:50–69
    [Google Scholar]
  63. Puterman ML. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming New York: Wiley
    [Google Scholar]
  64. Rakhlin A, Sridharan K. 2013. Online learning with predictable sequences. JMLR Worsh. Conf. Proc. 30:993–1019
    [Google Scholar]
  65. Rasmussen CE, Williams CKI. 2005. Gaussian Processes for Machine Learning Cambridge, MA: MIT Press
    [Google Scholar]
  66. Rigollet P, Tsybakov AB. 2012. Sparse estimation by exponential weighting. Stat. Sci. 27:558–75
    [Google Scholar]
  67. Robbins H. 1951. Asymptotically subminimax solutions of compound statistical decision problems. Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability J Neyman 131–49 Berkeley: Univ. Calif. Press
    [Google Scholar]
  68. Rosenblatt F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65:386
    [Google Scholar]
  69. Shalev-Shwartz S. 2007. Online learning: theory, algorithms, and applications PhD Thesis, Hebrew Univ Jerusalem:
    [Google Scholar]
  70. Shalev-Shwartz S. 2012. Online learning and online convex optimization. Found. Trends Mach. Learn. 4:107–94
    [Google Scholar]
  71. Shalev-Shwartz S, Singer Y, Srebro N, Cotter A 2011. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127:3–30
    [Google Scholar]
  72. Shtarkov YM. 1987. Universal sequential coding of single messages. Probl. Pereda. Inform. 23:3–17
    [Google Scholar]
  73. Streeter M, McMahan B. 2012. No-regret algorithms for unconstrained online convex optimization. Advances in Neural Information Processing Systems 25 F Periera, CJC Burges, L Bottou, KQ Weinberger 2402–10 Red Hook, NY: Curran
    [Google Scholar]
  74. Syrgkanis V, Agarwal A, Luo H, Schapire RE 2015. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems 28 C Cortes, ND Lawrence, DD Lee, M Sugiyama, R Garnett 2989–97 Red Hook, NY: Curran
    [Google Scholar]
  75. Vovk V. 2001. Competitive on-line statistics. Int. Stat. Rev. 69:213–48
    [Google Scholar]
  76. Zhang L, Lu S, Yang T 2020. Minimizing dynamic regret and adaptive regret simultaneously. arXiv:2002.02085 [cs.LG]
  77. Zhang L, Lu S, Zhou ZH 2018a. Adaptive online learning in dynamic environments. Advances in Neural Information Processing Systems 31 S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 1323–33 Red Hook, NY: Curran
    [Google Scholar]
  78. Zhang L, Yang T, Zhou ZH et al. 2018b. Dynamic regret of strongly adaptive methods. Proc. Mach. Learn. Res. 80:5882–91
    [Google Scholar]
  79. Zinkevich M. 2003. Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th International Conference on Machine Learning T Fawcett, N Mishra 928–36 Menlo Park, CA: AAAI
    [Google Scholar]
  80. Zinkevich M. 2004. Theoretical guarantees for algorithms in multi-agent settings PhD Thesis, Sch. Comput. Sci., Carnegie Mellon University Pittsburgh, PA:
    [Google Scholar]
/content/journals/10.1146/annurev-statistics-040620-035329
Loading
/content/journals/10.1146/annurev-statistics-040620-035329
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error