Online Learning Algorithms

Nicolò Cesa-Bianchi; Francesco Orabona

doi:10.1146/annurev-statistics-040620-035329

Annual Review of Statistics and Its Application

Volume 8, 2021

Review Article

Free

Online Learning Algorithms

Nicolò Cesa-Bianchi¹, and Francesco Orabona²
View Affiliations Hide Affiliations

Affiliations: ¹Department of Computer Science and Data Science Research Center, Università degli Studi di Milano, Milano 20133, Italy; email: [email protected] ²Department of Electrical and Computer Engineering, Boston University, Boston, Massachusetts 02215, USA; email: [email protected]
Vol. 8:165-190 (Volume publication date March 2021) https://doi.org/10.1146/annurev-statistics-040620-035329
First published as a Review in Advance on November 09, 2020
Copyright © 2021 by Annual Reviews. All rights reserved

Abstract

Online learning is a framework for the design and analysis of algorithms that build predictive models by processing data one at the time. Besides being computationally efficient, online algorithms enjoy theoretical performance guarantees that do not rely on statistical assumptions on the data source. In this review, we describe some of the most important algorithmic ideas behind online learning and explain the main mathematical tools for their analysis. Our reference framework is online convex optimization, a sequential version of convex optimization within which most online algorithms are formulated. More specifically, we provide an in-depth description of online mirror descent and follow the regularized leader, two of the most fundamental algorithms in online learning. As the tuning of parameters is a typically difficult task in sequential data analysis, in the last part of the review we focus on coin-betting, an information-theoretic approach to the design of parameter-free online algorithms with good theoretical guarantees.

Keyword(s): classification, convex optimization, regression, regret minimization

Article metrics loading...

/content/journals/10.1146/annurev-statistics-040620-035329

2021-03-07

2024-04-29

Full text loading...

/deliver/fulltext/statistics/8/1/annurev-statistics-040620-035329.html?itemId=/content/journals/10.1146/annurev-statistics-040620-035329&mimeType=html&fmt=ahah

Literature Cited

Abernethy J, Hazan E, Rakhlin A 2008. Competing in the dark: an efficient algorithm for bandit linear optimization. Proceedings of the 21st Annual Conference on Learning Theory263–73 Madison, WI: Omnipress
[Google Scholar]
Agarwal N, Gonen A, Hazan E 2019. Learning in non-convex games with an optimization oracle. Proceedings of the 32nd Annual Conference on Learning Theory Brookline, MA: Microtome http://proceedings.mlr.press/v99/agarwal19a/agarwal19a.pdf
[Google Scholar]
Arora R, Dekel O, Tewari A 2012a. Online bandit learning against an adaptive adversary: from regret to policy regret. Proceedings of the 29th International Conference on Machine Learning Madison, WI: Omnipress http://icml.cc/2012/papers/749.pdf
[Google Scholar]
Arora S, Hazan E, Kale S 2012b. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput 8:121–64
[Google Scholar]
Azoury KS, Warmuth MK. 2001. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 43:211–46
[Google Scholar]
Beck A, Teboulle M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31:167–75
[Google Scholar]
Berger JO. 2013. Statistical Decision Theory and Bayesian Analysis New York: Springer
Besbes O, Gur Y, Zeevi A 2015. Nonstationary stochastic optimization. Oper. Res. 63:1227–44
[Google Scholar]
Blackwell D. 1956. An analog of the minimax theorem for vector payoffs. Pac. J. Math. 6:1–8
[Google Scholar]
Block HD. 1962. The perceptron: a model for brain functioning. Rev. Mod. Phys. 34:123
[Google Scholar]
Blum A, Mansour Y. 2007. From external to internal regret. J. Mach. Learn. Res. 8:1307–24
[Google Scholar]
Borodin A, El-Yaniv R. 2005. Online Computation and Competitive Analysis Cambridge, UK: Cambridge Univ. Press
Bubeck S. 2015. Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8:231–357
[Google Scholar]
Cesa-Bianchi N, Freund Y, Haussler D, Helmbold DP, Schapire RE, Warmuth MK 1997. How to use expert advice. J. ACM 44:427–85
[Google Scholar]
Cesa-Bianchi N, Lugosi G. 2006. Prediction, Learning, and Games Cambridge, UK: Cambridge Univ. Press
Chaudhuri K, Freund Y, Hsu DJ 2009. A parameter-free hedging algorithm. Advances in Neural Information Processing Systems 22 Y Bengio, D Schuurmans, JD Lafferty, CKI Williams, A Culotta 297–305 Red Hook, NY: Curran
[Google Scholar]
Chernov A, Vovk V. 2010. Prediction with advice of unknown number of experts. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence P Grunwald, P Spirtes 117–25 Arlington, VA: AUAI
[Google Scholar]
Chiang CK, Yang T, Lee CJ, Mahdavi M, Lu CJ et al. 2012. Online optimization with gradual variations. JMLR Worksh. Conf. Proc. 23:6.1–20
[Google Scholar]
Cortes C, Vapnik V. 1995. Support-vector networks. Mach. Learn. 20:273–97
[Google Scholar]
Cover T. 1967. Behaviour of sequential predictors of binary sequences. Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes263–72 Prague: Czechoslov. Acad. Sci.
[Google Scholar]
Cover TM. 1974. Universal gambling schemes and the complexity measures of Kolmogorov and Chaitin Tech. Rep. 12 Dep. Stat., Stanford Univ Stanford, CA:
Cutkosky A, Orabona F. 2018. Black-box reductions for parameter-free online learning in Banach spaces. Proc. Mach. Learn. Res. 75:1–37
[Google Scholar]
Dalalyan A, Tsybakov AB. 2008. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72:39–61
[Google Scholar]
Dalalyan AS, Salmon J. 2012. Sharp oracle inequalities for aggregation of affine estimators. Ann. Stat. 40:2327–55
[Google Scholar]
Daniely A, Gonen A, Shalev-Shwartz S 2015. Strongly adaptive online learning. Proc. Mach. Learn. Res. 37:1405–11
[Google Scholar]
De Rosa R, Orabona F, Cesa-Bianchi N 2015. The ABACOC algorithm: a novel approach for nonparametric classification of data streams. 2015 IEEE International Conference on Data Mining733–38 New York: IEEE
[Google Scholar]
Duchi J, Hazan E, Singer Y 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:2121–59
[Google Scholar]
Feder M. 1991. Gambling using a finite state machine. IEEE Trans. Inf. Theory 37:1459–65
[Google Scholar]
Feder M, Merhav N, Gutman M 1992. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 38:1258–70
[Google Scholar]
Foster DJ, Li Z, Lykouris T, Sridharan K, Tardos E 2016. Learning in games: robustness of fast convergence. Advances in Neural Information Processing Systems 29 DD Lee, U von Luxburg, R Garnett, M Sugiyama, I Guyon 4734–42 Red Hook, NY: Curran
[Google Scholar]
Freund Y, Schapire RE. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55:119–39
[Google Scholar]
Freund Y, Schapire RE. 1999. Large margin classification using the perceptron algorithm. Mach. Learn. 37:277–96
[Google Scholar]
Freund Y, Schapire RE, Singer Y, Warmuth MK 1997. Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing334–43 New York: ACM
[Google Scholar]
Fudenberg D, Levine DK. 1995. Consistency and cautious fictitious play. J. Econ. Dyn. Control 19:1065–89
[Google Scholar]
Gentile C, Littlestone N. 1999. The robustness of the p-norm algorithms. Proceedings of the 12th Annual Conference on Computational Learning Theory1–11 New York: ACM
[Google Scholar]
Hannan J. 1957. Approximation to Bayes risk in repeated play. Contrib. Theory Games 3:97–139
[Google Scholar]
Hazan E. 2016. Introduction to online convex optimization. Found. Trends Optim. 2:157–325
[Google Scholar]
Hazan E, Agarwal A, Kale S 2007. Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69:169–92
[Google Scholar]
Hazan E, Megiddo N. 2007. Online learning with prior knowledge. Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007 NH Bshouty, C Gentile 499–513 New York: Springer
[Google Scholar]
Hazan E, Seshadhri C. 2007. Adaptive algorithms for online decision problems. Electron. Colloq. Comput. Complexity 14:88
[Google Scholar]
Herbster M, Warmuth MK. 1998a. Tracking the best expert. Mach. Learn. 32:151–78
[Google Scholar]
Herbster M, Warmuth MK. 1998b. Tracking the best regressor. Proceedings of the 11th Annual Conference on Computational Learning Theory24–31 New York: ACM
[Google Scholar]
Hoerl AE, Kennard RW. 2000. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 42:80–86
[Google Scholar]
Jun KS, Orabona F, Wright S, Willett R 2017. Improved strongly adaptive online learning using coin betting. Proc. Mach. Learn. Res. 54:943–51
[Google Scholar]
Kalai A, Vempala S. 2005. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71:291–307
[Google Scholar]
Kelly J. 1956. A new interpretation of information rate. Bell Syst. Technol. J. 35:917–26
[Google Scholar]
Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inform. Comput. 132:1–63
[Google Scholar]
Kivinen J, Warmuth MK. 1999. Averaging expert predictions. EuroCOLT 1999: Computational Learning Theory P Fischer, HU Simon 153–67 New York: Springer
[Google Scholar]
Koolen WM, Van Erven T 2015. Second-order quantile methods for experts and combinatorial games. Proc. Mach. Learn. Res. 40:1764–66
[Google Scholar]
Krichevsky R, Trofimov V. 1981. The performance of universal encoding. IEEE Trans. Inform. Theory 27:199–207
[Google Scholar]
Kulis B, Bartlett PL. 2010. Implicit online learning. Proceedings of the 27th International Conference on Machine Learning J Fürnkranz, T Joachims 575–82 Madison, WI: Omnipress
[Google Scholar]
Kuzborskij I, Cesa-Bianchi N. 2017. Nonparametric online regression while learning the metric. Advances in Neural Information Processing Systems 30 I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus et al.667–76 Red Hook, NY: Curran
[Google Scholar]
Littlestone N, Warmuth MK. 1994. The weighted majority algorithm. Inform. Comput. 108:212–61
[Google Scholar]
Luo H, Agarwal A, Cesa-Bianchi N, Langford J 2016. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems 29 DD Lee, M Sugiyama, UV Luxburg, I Guyon, R Garnett 902–10 Red Hook, NY: Curran
[Google Scholar]
McMahan HB. 2017. A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18:1–50
[Google Scholar]
McMahan HB, Streeter MJ. 2010. Adaptive bound optimization for online convex optimization. Proceedings of the 23rd Annual Conference on Learning Theory AT Kalai, M Mohri 244–56 Madison, WI: Omnipress
[Google Scholar]
Mokhtari A, Ozdaglar A, Pattathil S 2019. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: proximal point approach. arXiv:1901.08511 [math.OC]
Nemirovsky AS, Yudin DB. 1983. Problem Complexity and Method Efficiency in Optimization New York: Wiley
Novikoff AB. 1963. On convergence proofs for perceptrons Tech. rep., SRI Menlo Park, CA:
Orabona F. 2019. A modern introduction to online learning. arXiv:1912.13213 [cs.LG]
Orabona F, Pál D. 2016. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems 29 DD Lee, M Sugiyama, UV Luxburg, I Guyon, R Garnett 577–85 Red Hook, NY: Curran
[Google Scholar]
Orabona F, Pál D. 2018. Scale-free online learning. Theor. Comput. Sci. 716:50–69
[Google Scholar]
Puterman ML. 2014. Markov Decision Processes: Discrete Stochastic Dynamic Programming New York: Wiley
Rakhlin A, Sridharan K. 2013. Online learning with predictable sequences. JMLR Worsh. Conf. Proc. 30:993–1019
[Google Scholar]
Rasmussen CE, Williams CKI. 2005. Gaussian Processes for Machine Learning Cambridge, MA: MIT Press
Rigollet P, Tsybakov AB. 2012. Sparse estimation by exponential weighting. Stat. Sci. 27:558–75
[Google Scholar]
Robbins H. 1951. Asymptotically subminimax solutions of compound statistical decision problems. Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability J Neyman 131–49 Berkeley: Univ. Calif. Press
[Google Scholar]
Rosenblatt F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65:386
[Google Scholar]
Shalev-Shwartz S. 2007. Online learning: theory, algorithms, and applications PhD Thesis, Hebrew Univ Jerusalem:
Shalev-Shwartz S. 2012. Online learning and online convex optimization. Found. Trends Mach. Learn. 4:107–94
[Google Scholar]
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A 2011. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127:3–30
[Google Scholar]
Shtarkov YM. 1987. Universal sequential coding of single messages. Probl. Pereda. Inform. 23:3–17
[Google Scholar]
Streeter M, McMahan B. 2012. No-regret algorithms for unconstrained online convex optimization. Advances in Neural Information Processing Systems 25 F Periera, CJC Burges, L Bottou, KQ Weinberger 2402–10 Red Hook, NY: Curran
[Google Scholar]
Syrgkanis V, Agarwal A, Luo H, Schapire RE 2015. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems 28 C Cortes, ND Lawrence, DD Lee, M Sugiyama, R Garnett 2989–97 Red Hook, NY: Curran
[Google Scholar]
Vovk V. 2001. Competitive on-line statistics. Int. Stat. Rev. 69:213–48
[Google Scholar]
Zhang L, Lu S, Yang T 2020. Minimizing dynamic regret and adaptive regret simultaneously. arXiv:2002.02085 [cs.LG]
Zhang L, Lu S, Zhou ZH 2018a. Adaptive online learning in dynamic environments. Advances in Neural Information Processing Systems 31 S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett 1323–33 Red Hook, NY: Curran
[Google Scholar]
Zhang L, Yang T, Zhou ZH et al. 2018b. Dynamic regret of strongly adaptive methods. Proc. Mach. Learn. Res. 80:5882–91
[Google Scholar]
Zinkevich M. 2003. Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th International Conference on Machine Learning T Fawcett, N Mishra 928–36 Menlo Park, CA: AAAI
[Google Scholar]
Zinkevich M. 2004. Theoretical guarantees for algorithms in multi-agent settings PhD Thesis, Sch. Comput. Sci., Carnegie Mellon University Pittsburgh, PA:

/content/journals/10.1146/annurev-statistics-040620-035329

Online Learning Algorithms

Annual Review of Statistics and Its Application 8, 165 (2021); https://doi.org/10.1146/annurev-statistics-040620-035329

/content/journals/10.1146/annurev-statistics-040620-035329

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Probabilistic Forecasting
  
  Tilmann Gneiting, and Matthias Katzfuss
  
  Vol. 1 (2014), pp. 125–151
- Functional Data Analysis
  
  Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller
  
  Vol. 3 (2016), pp. 257–295
- Bayesian Computing with INLA: A Review
  
  Håvard Rue, Andrea Riebler, Sigrunn H. Sørbye, Janine B. Illian, Daniel P. Simpson, and Finn K. Lindgren
  
  Vol. 4 (2017), pp. 395–421
- Functional Regression
  
  Jeffrey S. Morris
  
  Vol. 2 (2015), pp. 321–359
- Topological Data Analysis
  
  Larry Wasserman
  
  Vol. 5 (2018), pp. 501–532
- Algorithmic Fairness: Choices, Assumptions, and Definitions
  
  Shira Mitchell, Eric Potash, Solon Barocas, Alexander D'Amour, and Kristian Lum
  
  Vol. 8 (2021), pp. 141–163
- Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis
  
  Hongzhe Li
  
  Vol. 2 (2015), pp. 73–94
- Learning Deep Generative Models
  
  Ruslan Salakhutdinov
  
  Vol. 2 (2015), pp. 361–385
- On p-Values and Bayes Factors
  
  Leonhard Held, and Manuela Ott
  
  Vol. 5 (2018), pp. 393–419
- High-Dimensional Statistics with a View Toward Applications in Biology
  
  Peter Bühlmann, Markus Kalisch, and Lukas Meier
  
  Vol. 1 (2014), pp. 255–278
More Less

Annual Review of Statistics and Its Application

Volume 8, 2021

Review Article

Free

Online Learning Algorithms

Abstract

Most Read This Month

Most Cited Most Cited RSS feed