Building intelligent systems that are capable of extracting high-level representations from high-dimensional sensory data lies at the core of solving many artificial intelligence–related tasks, including object recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires models with deep architectures that involve many layers of nonlinear processing. In this article, we review several popular deep learning models, including deep belief networks and deep Boltzmann machines. We show that () these deep generative models, which contain many layers of latent variables and millions of parameters, can be learned efficiently, and () the learned high-level feature representations can be successfully applied in many application domains, including visual object recognition, information retrieval, classification, and regression tasks.


Article metrics loading...

Loading full text...

Full text loading...


Literature Cited

  1. Bengio Y. 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2:1–127 [Google Scholar]
  2. Bengio Y, Lamblin P, Popovici D, Larochelle H. 2007. Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19:153–60 [Google Scholar]
  3. Bengio Y, LeCun Y. 2007. Scaling learning algorithms towards AI. Large-Scale Kernel Machines L Bottou, O Chapelle, D DeCoste, J Weston 321–360 Cambridge, MA: MIT Press [Google Scholar]
  4. Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3:993–1022 [Google Scholar]
  5. Blei DM. 2014. Build, compute, critique, repeat: data analysis with latent variable models. Annu. Rev. Stat. Appl. 1:203–32 [Google Scholar]
  6. Collobert R, Weston J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. Proc. 25th Int. Conf. Mach. Learn. Helsinki Jul. 5–9 160–67 New York: ACM [Google Scholar]
  7. Dahl GE, Jaitly N, Salakhutdinov R. 2014. Multi-task neural networks for QSAR predictions. arXiv:1406.1231 [stat.ML]
  8. Decoste D, Schölkopf B. 2002. Training invariant support vector machines. Mach. Learn. 46:161–90 [Google Scholar]
  9. Guillaumin M, Verbeek J, Schmid C. 2010. Multimodal semi-supervised learning for image classification. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., San Francisco Jun. 13–18 902–9 Piscataway, NJ: IEEE [Google Scholar]
  10. Hinton GE. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput. 14:81711–800 [Google Scholar]
  11. Hinton GE. 2007. To recognize shapes, first learn to generate images. Prog. Brain Res. 165:535–47 [Google Scholar]
  12. Hinton GE, Salakhutdinov RR. 2006. Reducing the dimensionality of data with neural networks. Science 313:5786504–7 [Google Scholar]
  13. Hinton GE, Sejnowski T. 1983. Optimal perceptual inference. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Washington, DC448–53 Silver Spring, MD: IEEE [Google Scholar]
  14. Hinton GE, Osindero S, Teh Y-W. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18:71527–54 [Google Scholar]
  15. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A. et al. 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29:682–97 [Google Scholar]
  16. Hofmann T. 1999. Probabilistic latent semantic analysis. Proc. 15th Conf. Uncertainty in Artif. Intell. Stockholm, Swe. Jul. 30–Aug. 1 289–96 San Francisco: Morgan Kaufmann [Google Scholar]
  17. Huiskes MJ, Lew MS. 2008. The MIR Flickr retrieval evaluation. Proc. 16th Int. Conf. Multimed. Inf. Retr., Vancouver, Can.Oct. 30–31 New York: ACM [Google Scholar]
  18. Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25:1106–14 [Google Scholar]
  19. Larochelle H, Bengio Y, Louradour J, Lamblin P. 2009. Exploring strategies for training deep neural networks. J. Mach. Learn. Res. 10:1–40 [Google Scholar]
  20. LeCun Y, Huang FJ, Bottou L. 2004. Learning methods for generic object recognition with invariance to pose and lighting. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Washington, DC Jun. 27–Jul. 2 297–104 Los Alamitos, CA: IEEE [Google Scholar]
  21. Lee H, Grosse R, Ranganath R, Ng AY. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proc. 26th Intl. Conf. Mach. Learn. Montreal Jun. 14–18 609–16 New York: ACM [Google Scholar]
  22. Lee TS, Mumford D, Romero R, Lamme V. 1998. The role of the primary visual cortex in higher level vision. Vision Res. 38:2429–54 [Google Scholar]
  23. Lenz I, Lee H, Saxena A. 2013. Deep learning for detecting robotic grasps. Proc. Robot. Sci. Syst. IX Berlin, Ger.Jun. 24–28 http://www.roboticsproceedings.org/rss09/p12.pdf
  24. Memisevic R, Hinton GE. 2010. Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Comput. 22:61473–92 [Google Scholar]
  25. Mohamed A, Dahl GE, Hinton G. 2012. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Proc. 20:14–22 [Google Scholar]
  26. Nair V, Hinton GE. 2009. Implicit mixtures of restricted Boltzmann machines. Adv. Neural Inf. Process. Syst. 21:1145–52 [Google Scholar]
  27. Neal RM. 2001. Annealed importance sampling. Stat. Comput. 11:125–39 [Google Scholar]
  28. Ranzato MA, Huang F, Boureau Y, LeCun Y. 2007. Unsupervised learning of invariant feature hierarchies with applications to object recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Minneapolis, MN June 18–23 1–8 Piscataway, NJ: IEEE doi: 10.1109/CVPR.2007.383157 [Google Scholar]
  29. Ranzato M, Boureau Y-L, LeCun Y. 2008. Sparse feature learning for deep belief networks. Adv. Neural Inform. Proc. Syst. 20:1185–92 [Google Scholar]
  30. Robbins H, Monro S. 1951. A stochastic approximation method. Ann. Math. Stat. 22:400–7 [Google Scholar]
  31. Rumelhart DE, Hinton GE, Williams RJ. 1986. Learning representations by back-propagating errors. Nature 323:533–36 [Google Scholar]
  32. Salakhutdinov RR. 2008. Learning and evaluating Boltzmann machines Tech. Rep. UTML TR 2008-002, Dep. Comput. Sci., Univ. Toronto, Toronto
  33. Salakhutdinov RR, Hinton GE. 2007. Learning a nonlinear embedding by preserving class neighbourhood structure. Proc. 11th Int. Conf. Artif. Intell. Stat., San Juan, PR Mar. 21–24 412–19 Brookline, MA: Microtome [Google Scholar]
  34. Salakhutdinov RR, Hinton GE. 2008. Using deep belief nets to learn covariance kernels for Gaussian processes. Adv. Neural Inf. Process. Syst. 20:1249–56 [Google Scholar]
  35. Salakhutdinov RR, Hinton GE. 2009a. Deep Boltzmann machines. Proc. 12th Int. Conf. Artif. Intell. Stat., Clearwater Beach, FL Apr. 16–18 448–55 Brookline, MA: Microtome [Google Scholar]
  36. Salakhutdinov RR, Hinton GE. 2009b. Replicated softmax: an undirected topic model. Adv. Neural Inf. Proc. Syst. 22:1607–14 [Google Scholar]
  37. Salakhutdinov RR, Hinton GE. 2009c. Semantic hashing. Int. J. Approx. Reason. 50:969–78 [Google Scholar]
  38. Salakhutdinov RR, Hinton GE. 2013. Modeling documents with Deep Boltzmann Machines. Proc. 29th Conf. Uncertain. Artif. Intell., Bellevue, WA Jul. 11–15 616–24 Corvallis, OR: AUAI Press [Google Scholar]
  39. Salakhutdinov RR, Murray I. 2008. On the quantitative analysis of deep belief networks. Proc. 25th Int. Conf. Mach. Learn., Helsinki Jul. 5–9 872–79 New York: ACM [Google Scholar]
  40. Serre T, Oliva A, Poggio TA. 2007. A feedforward architecture accounts for rapid categorization. PNAS 104:6424–29 [Google Scholar]
  41. Smolensky P. 1986. Information processing in dynamical systems: foundations of harmony theory. Parallel Distributed Processing DE Rumelhart, JL McClelland, chapter 6 194–281 Cambridge, MA: MIT Press [Google Scholar]
  42. Socher R, Huang EH, Pennington J, Ng AY, Manning CD. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv. Neural Inf. Process. Syst. 24:801–9 [Google Scholar]
  43. Srivastava N, Salakhutdinov R. 2014. Multimodal learning with deep Boltzmann machines. J. Machine Learn. Res. 15:2949–80 [Google Scholar]
  44. Sutskever I, Martens J, Dahl G, Hinton G. 2013. On the importance of momentum and initialization in deep learning. Proc. 30th Int. Conf. Mach. Learn., Atlanta Jun. 16–21 1139–47
  45. Taylor G, Hinton GE, Roweis ST. 2006. Modeling human motion using binary latent variables. Adv. Neural Inf. Process. Syst. 19:1345–52 [Google Scholar]
  46. Taylor GW, Fergus R, LeCun Y, Bregler C. 2010. Convolutional learning of spatio-temporal features. Proc. 11th Eur. Conf. Comput. Vis., Crete, Greece Sep. 5–11 140–153 Berlin: Springer [Google Scholar]
  47. Tieleman T. 2008. Training restricted Boltzmann machines using approximations to the likelihood gradient. Proc. 25th Int. Conf. Mach. Learn., Helsinki Jul. 5–9 1064–71 New York: ACM [Google Scholar]
  48. Torralba A, Fergus R, Weiss Y. 2008. Small codes and large image databases for recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Anchorage, AK Jun. 23–28 1–8 Silver Spring, MD: IEEE doi: 10.1109/CVPR.2008.4587633 [Google Scholar]
  49. Uria B, Murray I, Larochelle H. 2014. A deep and tractable density estimator. Proc. 31st Int. Conf. Mach. Learn. Beijing Jun. 21–26 467–75 Brookline, MA: Microtome [Google Scholar]
  50. Vincent P, Larochelle H, Bengio Y, Manzagol P. 2008. Extracting and composing robust features with denoising autoencoders. Proc. 25th Int. Conf. Mach. Learn. Helsinki, Jul. 5–9 1096–103 New York: ACM [Google Scholar]
  51. Wang T, Wu D, Coates A, Ng AY. 2012. End-to-end text recognition with convolutional neural networks. Proc. 21st Int. Conf. Pattern Recognit., Tsukuba, Jpn. Nov. 11–15 3304–8 Piscataway, NJ: IEEE [Google Scholar]
  52. Welling M, Rosen-Zvi M, Hinton GE. 2005. Exponential family harmoniums with an application to information retrieval. Adv. Neural Inf. Process. Syst. 17:1481–88 [Google Scholar]
  53. Younes L. 1989. Parameter inference for imperfectly observed Gibbsian fields. Probab. Theory Rel. Fields 82:625–45 [Google Scholar]
  54. Younes L. 1999. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stoch. Stoch. Rep. 65:177–228 [Google Scholar]
  55. Yuille AL. 2004. The convergence of Contrastive Divergences. Adv. Neural Inf. Process. Syst. 17:1593–600 [Google Scholar]

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error