Invariant Recognition Shapes Neural Representations of Visual Input

Andrea Tacchetti; Leyla Isik; Tomaso A. Poggio

doi:10.1146/annurev-vision-091517-034103

Annual Review of Vision Science

Volume 4, 2018

Review Article

Free

Invariant Recognition Shapes Neural Representations of Visual Input

Andrea Tacchetti¹, Leyla Isik¹, and Tomaso A. Poggio¹
View Affiliations Hide Affiliations

Affiliations: Center for Brains, Minds and Machines, MIT, Cambridge, Massachusetts 02139, USA; email: [email protected], [email protected], [email protected]
Vol. 4:403-422 (Volume publication date September 2018) https://doi.org/10.1146/annurev-vision-091517-034103
First published as a Review in Advance on July 27, 2018
Copyright © 2018 by Annual Reviews. All rights reserved

Abstract

Recognizing the people, objects, and actions in the world around us is a crucial aspect of human perception that allows us to plan and act in our environment. Remarkably, our proficiency in recognizing semantic categories from visual input is unhindered by transformations that substantially alter their appearance (e.g., changes in lighting or position). The ability to generalize across these complex transformations is a hallmark of human visual intelligence, which has been the focus of wide-ranging investigation in systems and computational neuroscience. However, while the neural machinery of human visual perception has been thoroughly described, the computational principles dictating its functioning remain unknown. Here, we review recent results in brain imaging, neurophysiology, and computational neuroscience in support of the hypothesis that the ability to support the invariant recognition of semantic entities in the visual world shapes which neural representations of sensory input are computed by human visual cortex.

Keyword(s): computational neuroscience, invariance, neural decoding, visual representations

Article metrics loading...

/content/journals/10.1146/annurev-vision-091517-034103

2018-09-15

2024-05-08

Full text loading...

/deliver/fulltext/vision/4/1/annurev-vision-091517-034103.html?itemId=/content/journals/10.1146/annurev-vision-091517-034103&mimeType=html&fmt=ahah

Literature Cited

Agrawal P, Stansbury D, Malik J, Gallant J 2014. Pixels to voxels: modeling visual representation in the human brain. arXiv:1407.5104v1 [q-bio.NC]
Anselmi F, Leibo JZ, Rosasco L, Mutch J, Tacchetti A, Poggio T 2016.a Unsupervised learning of invariant representations. Theor. Comput. Sci. 633:112–21
[Google Scholar]
Anselmi F, Rosasco L, Poggio T 2016.b On invariance and selectivity in representation learning. Inf. Inference 5:2134–58
[Google Scholar]
Barlow HB 1972. Single units and sensation: a neuron doctrine for perceptual psychology. ? Perception 1:4371–94
[Google Scholar]
Bell AJ, Sejnowski TJ 1997. The “independent components” of natural scenes are edge filters. Vis. Res. 37:233327–38
[Google Scholar]
Bengio Y, Lee D-H, Bornschein J, Lin Z 2015. Towards biologically plausible deep learning. arXiv:1502.04156 [cs.LG]
Bruna J, Mallat S 2011. Classification with scattering operators. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition1561–66 Los Alamitos, CA: IEEE
[Google Scholar]
Carlson T, Tovar D, Alink A, Kriegeskorte N 2013. Representational dynamics of object vision: the first 1000 ms. J. Vis. 13:101
[Google Scholar]
Connor C, Brincat S, Pasupathy A 2007. Transformation of shape information in the ventral pathway. Curr. Opin. Neurobiol. 17:2140–47
[Google Scholar]
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F 2009. ImageNet: a large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition248–55 Los Alamitos, CA: IEEE
[Google Scholar]
Desimone R, Albright T, Gross C, Bruce C 1984. Stimulus-selective properties of inferior temporal neurons in the macaque. J. Neurosci. 4:82051–62
[Google Scholar]
DiCarlo J, Cox D 2007. Untangling invariant object recognition. Trends Cogn. Sci. 11:8333–41
[Google Scholar]
DiCarlo J, Zoccolan D, Rust N 2012. How does the brain solve visual object recognition. ? Neuron 73:3415–34
[Google Scholar]
Downing P, Jiang Y, Shuman M, Kanwisher N 2001. A cortical area selective for visual processing of the human body. Science 293:55392470–73
[Google Scholar]
Evangelopoulos G, Voinea S, Zhang C, Rosasco L, Poggio T 2014. Learning an invariant speech representation. arXiv:1406.3884 [cs.SD]
Felleman D, Van Essen D 1991. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1:11
[Google Scholar]
Freeman J, Simoncelli E 2011. Metamers of the ventral stream. Nat. Neurosci. 14:91195–201
[Google Scholar]
Freiwald WA, Tsao DY 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system. Science 330:6005845–51
[Google Scholar]
Fukushima K 1980. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernet. 36:4193–202
[Google Scholar]
Gallant J, Braun J, Van Essen D 1993. Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. Science 259:5091100–3
[Google Scholar]
Giese MA, Poggio T 2003. Neural mechanisms for the recognition of biological movements. Nat. Rev. Neurosci. 4:3179–92
[Google Scholar]
Granlund GH 1978. In search of a general picture processing operator. Comput. Graph. Image Proc. 8:2155–73
[Google Scholar]
Grill-Spector K, Weiner KS 2014. The functional architecture of the ventral temporal cortex and its role in categorization. Nat. Rev. Neurosci. 15:8536–48
[Google Scholar]
Gross CG, Schonen SD 1992. Representation of visual stimuli in inferior temporal cortex [and Discussion]. Philos. Trans. Biol. Sci. 335:12733–10
[Google Scholar]
Hebb D 1949. The Organization of Behavior New York: Psychol. Press
Hénaff O, Goris R, Simoncelli E 2017. Perceptual straightening of natural video trajectories. J. Vis. 17:10402
[Google Scholar]
Hinton G, Salakhutdinov R 2006. Reducing the dimensionality of data with neural networks. Science 313:5786504–7
[Google Scholar]
Hubel D, Wiesel T 1962. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J. Physiol. 160:1106–54
[Google Scholar]
Hung CP, Kreiman G, Poggio T, DiCarlo JJ 2005. Fast readout of object identity from macaque inferior temporal cortex. Science 310:5749863–66
[Google Scholar]
Isik L, Meyers E, Leibo J, Poggio T 2014. The dynamics of invariant object recognition in the human visual system. J. Neurophysiol. 111:191–102
[Google Scholar]
Isik L, Tacchetti A, Poggio T 2017. A fast, invariant representation for human action in the visual system. J. Neurophysiol. 119:2631–40
[Google Scholar]
Jhuang H, Serre T, Wolf L, Poggio T 2007. A biologically inspired system for action recognition Paper presented at the 11th International Conference on Computer Vision Rio de Janeiro, Braz.: Oct. 14–21
Jones J, Palmer L 1987. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58:61233–58
[Google Scholar]
Kanwisher N, Yovel G 2006. The fusiform face area: a cortical region specialized for the perception of faces. Philos. Trans. R. Soc. B 361:14762109
[Google Scholar]
Khaligh-Razavi S-M, Kriegeskorte N 2014. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLOS Comput. Biol. 10:11e1003915
[Google Scholar]
Kriegeskorte N 2015. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 1:417–46
[Google Scholar]
Kriegeskorte N, Mur M, Bandettini P 2008. Representational similarity analysis—connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2:4
[Google Scholar]
Krizhevsky A, Sutskever I, Hinton GE 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS)1097–105 Red Hook, NY: Curran Assoc.
[Google Scholar]
Lake B, Ullman T, Tenenbaum J, Gershman S 2016. Building machines that learn and think like people. arXiv:1604.00289 [cs.AI]
Larsson G, Maire M, Shakhnarovich G 2017. Colorization as a proxy task for visual understanding. arXiv:1703.04044 [cs.CV]
LeCun Y, Bengio Y, Hinton G 2015. Deep learning. Nature 521:7553436–44
[Google Scholar]
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE et al. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput 1:4541–51
[Google Scholar]
Leibo J, Liao Q, Anselmi F, Freiwald W, Poggio T 2016. View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation. Curr. Biol. 27:11–6
[Google Scholar]
Leibo J, Liao Q, Anselmi F, Poggio TA 2015. The invariance hypothesis implies domain-specific regions in visual cortex. PLOS Comput. Biol. 11:10e1004390
[Google Scholar]
Leibo J, Mutch J, Poggio T 2011. Learning to discount transformations as the computational goal of visual cortex Paper presented at the IEEE Conference on Vision and Pattern Recognition Colorado Springs, CO: June 20–25
Liao Q, Leibo J, Poggio T 2014. Unsupervised learning of clutter-resistant visual representations from natural videos. arXiv:1409.3879v1 [cs.CV]
Liao Q, Leibo J, Poggio T 2015. How important is weight symmetry in backpropagation. ? arXiv:1510.05067 [cs.LG]
Liu H, Agam Y, Madsen JR, Kreiman G 2009. Timing, timing, timing: fast decoding of object information from intracranial field potentials in human visual cortex. Neuron 62:2281–90
[Google Scholar]
Mallat SG 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11:7674–93
[Google Scholar]
Marr D 1982. Vision Cambridge, MA: MIT Press
Marr D, Nishihara HK 1978. Representation and recognition of the spatial organization of three-dimensional shapes. Proc. R. Soc. B 200:1140269–94
[Google Scholar]
Mazzoni P, Andersen R, Jordan M 1991. A more biologically plausible learning rule for neural networks. PNAS 88:104433–37
[Google Scholar]
Moeller S, Freiwald WA, Tsao DY 2008. Patches with links: a unified system for processing faces in the macaque temporal lobe. Science 320:58811355
[Google Scholar]
Mutch J, Anselmi F, Tacchetti A, Rosasco L, Leibo J, Poggio T 2017. Invariant recognition predicts tuning of neurons in sensory cortex. In Computational and Cognitive Neuroscience of Vision, ed. Q Zhao85–104 Singapore: Springer Singapore:
[Google Scholar]
Mutch J, Lowe D 2008. Object class recognition and localization using sparse features with limited receptive fields. Int. J. Comput. Vis. 80:145–57
[Google Scholar]
Niell C, Stryker M 2008. Highly selective receptive fields in mouse visual cortex. J. Neurosci. 28:307520–36
[Google Scholar]
Niyogi P, Girosi F, Poggio T 1998. Incorporating prior information in machine learning by creating virtual examples. Proc. IEEE 86:112196–208
[Google Scholar]
Olshausen BA, Field DJ 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381:6583607
[Google Scholar]
Pathak D, Krähenbühl P, Donahue J, Darrell T, Efros A 2016. Context encoders: feature learning by inpainting. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition2536–44 Los Alamitos, CA: IEEE
[Google Scholar]
Poggio TA, Anselmi F 2016. Visual Cortex and Deep Networks: Learning Invariant Representations Cambridge, MA: MIT Press
Poggio T, Edelman S 1990. A network that learns to recognize three-dimensional objects. Nature 343:6255263–66
[Google Scholar]
Poggio T, Liao Q 2017. Theory II: landscape of the empirical risk in deep learning. arXiv:1703.09833 [cs.LG]
Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q 2016. Why and when can deep—but not shallow—networks avoid the curse of dimensionality: a review. arXiv:1611.00740 [cs.LG]
Riesenhuber M, Poggio T 1999. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2:111019–25
[Google Scholar]
Ringach D 2002. Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. J. Neurophysiol. 88:1455–63
[Google Scholar]
Rust N, DiCarlo JJ 2010. Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 30:3912978–95
[Google Scholar]
Rust N, Mante V, Simoncelli E, Movshon JA 2006. How MT cells analyze the motion of visual patterns. Nat. Neurosci. 9:1421–31
[Google Scholar]
Schroff F, Kalenichenko D, Philbin J 2015. FaceNet: a unified embedding for face recognition and clustering. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition815–23 Los Alamitos, CA: IEEE
[Google Scholar]
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y 2013. OverFeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229 [cs.CV]
Serre T, Kreiman G, Kouh M, Cadieu C 2007.a A quantitative theory of immediate visual recognition. Prog. Brain Res. 165:33–56
[Google Scholar]
Serre T, Oliva A, Poggio T 2007.b A feedforward architecture accounts for rapid categorization. PNAS 104:156424–29
[Google Scholar]
Simoncelli EP, Freeman WT 1995. The steerable pyramid: a flexible architecture for multi-scale derivative computation. Proceedings of the International Conference on Image Processing, Vol. 3444–47 Los Alamitos, CA: IEEE
[Google Scholar]
Singer J, Sheinberg D 2010. Temporal cortex neurons encode articulated actions as slow sequences of integrated poses. J. Neurosci. 30:3133–45
[Google Scholar]
Soatto S, Chiuso A 2016. Visual representations: defining properties and deep approximations. arXiv:1411.7676 [cs.CV]
Tacchetti A, Isik L, Poggio T 2017.a Invariant recognition drives neural representations of action sequences. PLOS Comput. Biol. 13:12e1005859
[Google Scholar]
Tacchetti A, Voinea S, Evangelopoulos G 2017.b Discriminate-and-rectify encoders: learning from image transformation sets. arXiv:1703.04775 [cs.CV]
Thorpe S, Fize D, Marlot C 1996. Speed of processing in the human visual system. Nature 381:520–22
[Google Scholar]
Tsao D, Freiwald W, Tootell R 2006. A cortical region consisting entirely of face-selective cells. Science 311:5761670
[Google Scholar]
Tsao D, Livingstone M 2008. Mechanisms of face perception. Annu. Rev. Neurosci. 31:411–37
[Google Scholar]
Voinea S, Zhang C, Evangelopoulos G, Rosasco L, Poggio T 2014. Word-level invariant representations from acoustic waveforms. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH2385–89 Red Hook, NY: Curran Assoc.
[Google Scholar]
Wiskott L, Sejnowski T 2002. Slow feature analysis: unsupervised learning of invariances. Neural Comput 14:4715–70
[Google Scholar]
Yamins D, DiCarlo J 2016. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19:3356–65
[Google Scholar]
Yamins D, Hong H, Cadieu C, Solomon E, Seibert D, DiCarlo J 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex. PNAS 111:238619–24
[Google Scholar]
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O 2017.a Understanding deep learning requires rethinking generalization. arXiv:1611.03530 [cs.LG]
Zhang C, Evangelopoulos G, Voinea S, Rosasco L, Poggio T 2014.a A deep representation for invariance and music classification. arXiv:1404.0400 [cs.SD]
Zhang C, Liao Q, Rakhlin A, Sridharan K, Miranda B et al. 2017.b Theory of deep learning III: generalization properties of SGD Center Brains Minds Mach. Memo 067. https://cbmm.mit.edu/sites/default/files/publications/CBMM-Memo-067.pdf
Zhang C, Voinea S, Evangelopoulos G, Rosasco L, Poggio T 2014.b Phone classification by a hierarchy of invariant representation layers. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH2346–50 Red Hook, NY: Curran Assoc.
[Google Scholar]