Predicting Visual Fixations

Matthias Kümmerer; Matthias Bethge

doi:10.1146/annurev-vision-120822-072528

Annual Review of Vision Science

Volume 9, 2023

Review Article

Open Access

Predicting Visual Fixations

Matthias Kümmerer¹, and Matthias Bethge¹
View Affiliations Hide Affiliations

Affiliations: Tübingen AI Center, University of Tübingen, Tübingen, Germany; email: [email protected][email protected]
Vol. 9:269-291 (Volume publication date September 2023) https://doi.org/10.1146/annurev-vision-120822-072528
First published as a Review in Advance on July 07, 2023
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

As we navigate and behave in the world, we are constantly deciding, a few times per second, where to look next. The outcomes of these decisions in response to visual input are comparatively easy to measure as trajectories of eye movements, offering insight into many unconscious and conscious visual and cognitive processes. In this article, we review recent advances in predicting where we look. We focus on evaluating and comparing models: How can we consistently measure how well models predict eye movements, and how can we judge the contribution of different mechanisms? Probabilistic models facilitate a unified approach to fixation prediction that allows us to use explainable information explained to compare different models across different settings, such as static and video saliency, as well as scanpath prediction. We review how the large variety of saliency maps and scanpath models can be translated into this unifying framework, how much different factors contribute, and how we can select the most informative examples for model comparison. We conclude that the universal scale of information gain offers a powerful tool for the inspection of candidate mechanisms and experimental design that helps us understand the continual decision-making process that determines where we look.

Keyword(s): benchmarking, eye movements, fixations, information theory, model comparison, saliency, taxonomy, transfer learning, unifying framework

Article metrics loading...

/content/journals/10.1146/annurev-vision-120822-072528

2023-09-15

2024-05-06

Full text loading...

/deliver/fulltext/vision/9/1/annurev-vision-120822-072528.html?itemId=/content/journals/10.1146/annurev-vision-120822-072528&mimeType=html&fmt=ahah

Literature Cited

Adeli H, Vitu F, Zelinsky GJ. 2017. A model of the superior colliculus predicts fixation locations during scene viewing and visual search. J. Neurosci. 37:61453–67
[Google Scholar]
Anderson NC, Anderson F, Kingstone A, Bischof WF. 2015. A comparison of scanpath comparison methods. Behav. Res. Methods 47:41377–92
[Google Scholar]
Baddeley RJ, Tatler BW. 2006. High frequency edges (but not contrast) predict where we fixate: a Bayesian system identification analysis. Vis. Res. 46:182824–33
[Google Scholar]
Barthelmé S, Trukenbrod H, Engbert R, Wichmann F. 2013. Modeling fixation locations using spatial point processes. J. Vis. 13:121
[Google Scholar]
Bazzani L, Larochelle H, Torresani L. 2017. Recurrent mixture density network for spatiotemporal visual attention. ICLR 2017: International Conference on Learning Representations N.p. ICLR
[Google Scholar]
Berga D, Vidal XRF, Otazu X, Pardo XM. 2019. SID4VAM: a benchmark dataset with synthetic images for visual attention modeling. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV)8788–97. Piscataway, NJ: IEEE
[Google Scholar]
Boccignone G, Ferraro M. 2004. Modelling gaze shift as a constrained random walk. Phys. A 331:1207–18
[Google Scholar]
Borji A, Itti L. 2015. CAT2000: a large scale fixation dataset for boosting saliency research. arXiv:1505.03581 [cs.CV]
Borji A, Sihite DN, Itti L. 2013a. Objects do not predict fixations better than early saliency: a re-analysis of Einhäuser et al.’s data. J. Vis. 13:1018
[Google Scholar]
Borji A, Sihite DN, Itti L. 2013b. Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Trans. Image Proc. 22:155–69
[Google Scholar]
Box GEP. 1976. Science and statistics. J. Am. Stat. Assoc. 71:356791–99
[Google Scholar]
Box GEP 1979. Robustness in the strategy of scientific model building. Robustness in Statistics RL Launer, GN Wilkinson 201–36. Cambridge, MA: Academic Press
[Google Scholar]
Bruce NDB, Tsotsos JK. 2009. Saliency, attention, and visual search: an information theoretic approach. J. Vis. 9:35
[Google Scholar]
Buswell GT. 1935. How People Look at Pictures. Chicago: Univ. Chicago Press
[Google Scholar]
Bylinskii Z, Judd T, Oliva A, Torralba A, Durand F. 2018. What do different evaluation metrics tell us about saliency models?. IEEE Trans. Pattern Anal. Mach. Intel. 41:3740–57
[Google Scholar]
Bylinskii Z, Recasens A, Borji A, Oliva A, Torralba A, Durand F. 2016. Where should saliency models look next?. Computer Vision—ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, October 11–14809–24. Berlin: Springer
[Google Scholar]
Chen Y, Yang Z, Ahn S, Samaras D, Hoai M, Zelinsky G. 2021. COCO-Search18 fixation dataset for predicting goal-directed attention control. Sci. Rep. 11:8776
[Google Scholar]
Chen YC, Chang KJ, Tsai YH, Wang YCF, Chiu WC 2019. Guide your eyes: learning image manipulation under saliency guidance. Proceedings of the British Machine Vision Conference (BMVC) K Sidorov, Y Hicks 24.1–12. Durham, UK: BMVA Press
[Google Scholar]
Cover TM, Thomas JA. 2005. Rate distortion theory. Elements of Information Theory301–46. New York: Wiley
[Google Scholar]
Croce F, Andriushchenko M, Sehwag V, Debenedetti E, Flammarion N et al. 2021. RobustBench: a standardized adversarial robustness benchmark. Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) N.p. NeurIPS
[Google Scholar]
Didday RL, Arbib MA. 1975. Eye movements and visual perception: a “two visual system” model. Int. J. Man-Mach. Stud. 7:4547–69
[Google Scholar]
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N et al. 2014. DeCAF: a deep convolutional activation feature for generic visual recognition. Proceedings of the International Conference on Machine Learning647–55. N.p ICML
[Google Scholar]
Droste R, Jiao J, Noble JA 2020. Unified image and video saliency modeling. Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, August 23–28, 2020 A Vedaldi, H Bischof, T Brox, JM Frahm 419–35. Berlin: Springer
[Google Scholar]
Einhäuser W. 2013. Objects and saliency: reply to Borji et al. J. Vis. 13:1020
[Google Scholar]
Einhäuser W, König P. 2010. Getting real—sensory processing of natural stimuli. Curr. Opin. Neurobiol. 20:3389–95
[Google Scholar]
Einhäuser W, Spain M, Perona P. 2008. Objects predict fixations better than early saliency. J. Vis. 8:1418
[Google Scholar]
Engbert R, Trukenbrod HA, Barthelmé S, Wichmann FA. 2015. Spatial statistics and attentional dynamics in scene viewing. J. Vis. 15:114
[Google Scholar]
Gatys LA, Kümmerer M, Wallis TSA, Bethge M. 2017. Guiding human gaze with convolutional neural networks. arXiv:1712.06492 [cs.CV]
Golan T, Raju PC, Kriegeskorte N. 2020. Controversial stimuli: pitting neural networks against each other as models of human cognition. PNAS 117:4729330–37
[Google Scholar]
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D et al. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems 27 Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger 2672–80. Red Hook, NY: Curran Assoc.
[Google Scholar]
Harel J, Koch C, Perona P 2007. Graph-based visual saliency. Advances in Neural Information Processing Systems 19 B Schölkopf, JC Platt, T Hoffman 545–52. Cambridge, MA: MIT Press
[Google Scholar]
Hayhoe M. 2000. Vision using routines: a functional account of vision. Vis. Cogn. 7:1–343–64
[Google Scholar]
Henderson JM. 2003. Human gaze control during real-world scene perception. Trends Cogn. Sci. 7:11498–504
[Google Scholar]
Huang X, Shen C, Boix X, Zhao Q. 2015. SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)262–70. Piscataway, NJ: IEEE
[Google Scholar]
Itti L, Baldi P. 2009. Bayesian surprise attracts human attention. Vis. Res. 49:101295–306
[Google Scholar]
Itti L, Koch C. 2001a. Computational modelling of visual attention. Nat. Rev. Neurosci. 2:3194–203
[Google Scholar]
Itti L, Koch C. 2001. Feature combination strategies for saliency-based visual attention systems. J. Electron. Imag. 10:1161–70
[Google Scholar]
Itti L, Koch C, Niebur E. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intel. 20:111254–59
[Google Scholar]
Jiang L, Xu M, Liu T, Qiao M, Wang Z. 2018. DeepVS: a deep learning based video saliency prediction approach. Proceedings of the European Conference on Computer Vision (ECCV)602–17. Berlin: Springer
[Google Scholar]
Jiang L, Xu M, Wang X, Sigal L. 2021. Saliency-guided image translation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition16509–18. Piscataway, NJ: IEEE
[Google Scholar]
Jiang M, Huang S, Duan J, Zhao Q. 2015. SALICON: saliency in context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition1072–80. Piscataway, NJ: IEEE
[Google Scholar]
Judd T, Durand F, Torralba A. 2012. A benchmark of computational models of saliency to predict human fixations Tech. Rep., MIT Cambridge, MA:
Judd T, Ehinger K, Durand F, Torralba A. 2009. Learning to predict where humans look. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision2106–13. Piscataway, NJ: IEEE
[Google Scholar]
Kienzle W, Franz MO, Schölkopf B, Wichmann FA. 2009. Center-surround patterns emerge as optimal predictors for human saccade targets. J. Vis. 9:57
[Google Scholar]
Koch C, Ullman S. 1985. Shifts in selective visual attention: towards the underlying neural circuitry. Hum. Neurobiol. 4:219–27
[Google Scholar]
Kothari R, Yang Z, Kanan C, Bailey R, Pelz JB, Diaz GJ. 2020. Gaze-in-wild: a dataset for studying eye and head coordination in everyday activities. Sci. Rep. 10:2539
[Google Scholar]
Kowler E. 2011. Eye movements: the past 25 years. Vis. Res. 51:131457–83
[Google Scholar]
Krizhevsky A, Sutskever I, Hinton GE 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 F Pereira, CJC Burges, L Bottou, KQ Weinberger 1097–105. Red Hook, NJ: Curran Assoc.
[Google Scholar]
Kümmerer M, Bethge M. 2021. State-of-the-art in human scanpath prediction. arXiv:2102.12239 [cs.CV]
Kümmerer M, Bethge M, Wallis TSA. 2022. DeepGaze III: modeling free-viewing human scanpaths with deep learning. J. Vis. 22:57
[Google Scholar]
Kümmerer M, Theis L, Bethge M. 2015a. Deep Gaze I: boosting saliency prediction with feature maps trained on ImageNet. Proceedings of the 3rd International Conference on Learning Representations (ICLR) 2015, San Diego, CA, USA, May 7-9 N.p ICLR
[Google Scholar]
Kümmerer M, Wallis TSA, Bethge M. 2015b. Information-theoretic model comparison unifies saliency metrics. PNAS 112:5216054–59
[Google Scholar]
Kümmerer M, Wallis TSA, Bethge M 2018. Saliency benchmarking made easy: separating models, maps and metrics. Computer Vision—ECCV 2018: Proceedings of the 15th European Conference, Munich, Germany, September 8–14, 2018 V Ferrari, M Hebert, C Sminchisescu, Y Weiss, 798–814. Berlin: Springer
[Google Scholar]
Kümmerer M, Wallis TSA, Gatys LA, Bethge M. 2017. Understanding low- and high-level contributions to fixation prediction. Proceedings of the IEEE International Conference on Computer Vision (ICCV)4789–98. Piscataway, NJ: IEEE
[Google Scholar]
Land M, Mennie N, Rusted J. 1999. The roles of vision and eye movements in the control of activities of daily living. Perception 28:111311–28
[Google Scholar]
Le Meur O, Liu Z 2015. Saccadic model of eye movements for free-viewing condition. Vis. Res. 116:152–64
[Google Scholar]
Lecun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86:112278–324
[Google Scholar]
Li Y, Hou X, Koch C, Rehg JM, Yuille AL. 2014. The secrets of salient object segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition280–87. Piscataway, NJ: IEEE
[Google Scholar]
Li Z. 2002. A saliency map in primary visual cortex. Trends Cogn. Sci. 6:19–16
[Google Scholar]
Linardos A, Kümmerer M, Press O, Bethge M. 2021. DeepGaze IIE: calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. Proceedings of the IEEE/CVF International Conference on Computer Vision12919–28. Piscataway, NJ: IEEE
[Google Scholar]
Malem-Shinitski N, Opper M, Reich S, Schwetlick L, Seelig SA, Engbert R. 2020. A mathematical model of local and global attention in natural scene viewing. PLOS Comput. Biol. 16:12e1007880
[Google Scholar]
Mateescu VA, Bajic IV. 2016. Visual attention retargeting. IEEE MultiMedia 23:182–91
[Google Scholar]
Matthis JS, Yates JL, Hayhoe MM. 2018. Gaze and the control of foot placement when walking in natural terrain. Curr. Biol. 28:81224–33.e5
[Google Scholar]
Mejjati YA, Gomez CF, Kim KI, Shechtman E, Bylinskii Z 2020. Look here! A parametric learning based approach to redirect visual attention. Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, August 23–28, 2020 A Vedaldi, H Bischof, T Brox, JM Frahm 343–61. Berlin: Springer
[Google Scholar]
Mital PK, Smith TJ, Hill RL, Henderson JM. 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3:15–24
[Google Scholar]
Pan J, Ferrer CC, McGuinness K, O'Connor NE, Torres J et al. 2017. SalGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 [cs.CV]
Parkhurst D, Law K, Niebur E. 2002. Modeling the role of salience in the allocation of overt visual attention. Vis. Res. 42:1107–23
[Google Scholar]
Pedziwiatr MA, Kümmerer M, Wallis TSA, Bethge M, Teufel C. 2021. Meaning maps and saliency models based on deep convolutional neural networks are insensitive to image meaning when predicting human fixations. Cognition 206:104465
[Google Scholar]
Riche N, Duvinage M, Mancas M, Gosselin B, Dutoit T. 2013a. Saliency and human fixations: state-of-the-art and study of comparison metrics. Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV)1153–60. Piscataway, NJ: IEEE
[Google Scholar]
Riche N, Mancas M, Duvinage M, Mibulumukini M, Gosselin B, Dutoit T. 2013b. RARE2012: a multi-scale rarity-based saliency detection with its comparative statistical analysis. Signal Proc. Image Commun. 28:6642–58
[Google Scholar]
Rothkopf CA, Ballard DH, Hayhoe MM. 2007. Task and context determine where you look. J. Vis. 7:1416
[Google Scholar]
Schott L, Rauber J, Bethge M, Brendel W. 2019. Towards the first adversarially robust neural network model on MNIST. ICLR 2019: International Conference on Learning Representations N.p. ICLR
[Google Scholar]
Schütt HH, Rothkegel LOM, Trukenbrod HA, Reich S, Wichmann FA, Engbert R. 2017. Likelihood-based parameter estimation and comparison of dynamical cognitive models. Psychol. Rev. 124:4505–24
[Google Scholar]
Schwetlick L, Rothkegel LOM, Trukenbrod HA, Engbert R. 2020. Modeling the effects of perisaccadic attention on gaze statistics during scene viewing. Commun. Biol. 3:1727
[Google Scholar]
Shannon CE. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27:1379–423, 623–56
[Google Scholar]
Shannon CE, Weaver W. 1949. A Mathematical Theory of Communication Urbana: Univ. Ill. Press. , 1st ed..
Tangemann M, Kümmerer M, Wallis TSA, Bethge M. 2020. Measuring the importance of temporal features in video saliency. Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, August 23-28, 2020667–84. Berlin: Springer
[Google Scholar]
Tatler BW, Baddeley RJ, Gilchrist ID. 2005. Visual correlates of fixation selection: effects of scale and time. Vis. Res. 45:5643–59
[Google Scholar]
Torralba A, Oliva A, Castelhano MS, Henderson JM. 2006. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev. 113:4766–86
[Google Scholar]
Treisman AM, Gelade G. 1980. A feature-integration theory of attention. Cogn. Psychol. 12:197–136
[Google Scholar]
Vig E, Dorr M, Cox D. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2798–805. Piscataway, NJ: IEEE
[Google Scholar]
Wade NJ. 2010. Pioneers of eye movement research.. i-Perception 1:233–68
[Google Scholar]
Wagstaff K. 2012. Machine learning that matters. Proceedings of the 29th International Conference on Machine Learning J Langfort, J Pineau 1851–56. Norristown, PA: Omnipress
[Google Scholar]
Wang W, Shen J, Guo F, Cheng MM, Borji A. 2018. Revisiting video saliency: a large-scale benchmark and a new model. arXiv:1801.07424 [cs.CV]
Wilming N, Betz T, Kietzmann TC, König P. 2011. Measures and limits of models of fixation selection. PLOS ONE 6:9e24038
[Google Scholar]
Xu P, Ehinger KA, Zhang Y, Finkelstein A, Kulkarni SR, Xiao J. 2015. TurkerGaze: crowdsourcing saliency with webcam based eye tracking. arXiv:1504.06755 [cs.CV]
Xu Y, Dong Y, Wu J, Sun Z, Shi Z et al. 2018. Gaze prediction in dynamic 360° immersive videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition5333–42. Piscataway, NJ: IEEE
[Google Scholar]
Yarbus AL. 1967. Eye Movements and Vision New York: Plenum Press
Zhang L, Tong MH, Marks TK, Shan H, Cottrell GW. 2008. SUN: a Bayesian framework for saliency using natural statistics. J. Vis. 8:732
[Google Scholar]
Zhang M, Ma KT, Lim JH, Zhao Q, Feng J. 2017. Deep Future Gaze: gaze anticipation on egocentric videos using adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition4372–81. Piscataway, NJ: IEEE
[Google Scholar]
Zhao Q, Koch C. 2011. Learning a saliency map using fixated locations in natural scenes. J. Vis. 11:39
[Google Scholar]

/content/journals/10.1146/annurev-vision-120822-072528

Predicting Visual Fixations

Annual Review of Vision Science 9, 269 (2023); https://doi.org/10.1146/annurev-vision-120822-072528

/content/journals/10.1146/annurev-vision-120822-072528

Data & Media loading...

Supplemental Material

Supplementary Data

Download the Supplemental Appendix (PDF). Includes Supplemental Figures 1-6.

Article Type: Review Article

Most Cited Most Cited RSS feed

- Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing
  
  Nikolaus Kriegeskorte
  
  Vol. 1 (2015), pp. 417–446
- A Revised Neural Framework for Face Processing
  
  Brad Duchaine, and Galit Yovel
  
  Vol. 1 (2015), pp. 393–416
- Capabilities and Limitations of Peripheral Vision
  
  Ruth Rosenholtz
  
  Vol. 2 (2016), pp. 437–457
- Visual Adaptation
  
  Michael A. Webster
  
  Vol. 1 (2015), pp. 547–567
- Microglia in the Retina: Roles in Development, Maturity, and Disease
  
  Sean M. Silverman, and Wai T. Wong
  
  Vol. 4 (2018), pp. 45–77
- Circuits for Action and Cognition: A View from the Superior Colliculus
  
  Michele A. Basso, and Paul J. May
  
  Vol. 3 (2017), pp. 197–226
- Neuronal Mechanisms of Visual Attention
  
  John H.R. Maunsell
  
  Vol. 1 (2015), pp. 373–391
- The Functional Neuroanatomy of Human Face Perception
  
  Kalanit Grill-Spector, Kevin S. Weiner, Kendrick Kay, and Jesse Gomez
  
  Vol. 3 (2017), pp. 167–196
- The Organization and Operation of Inferior Temporal Cortex
  
  Bevil R. Conway
  
  Vol. 4 (2018), pp. 381–402
- Scene Perception in the Human Brain
  
  Russell A. Epstein, and Chris I. Baker
  
  Vol. 5 (2019), pp. 373–397
More Less

Annual Review of Vision Science

Volume 9, 2023

Review Article

Open Access

Predicting Visual Fixations

Abstract

Supplementary Data

Most Read This Month

Most Cited Most Cited RSS feed