1932

Abstract

Modern deep neural networks achieve impressive performance in engineering applications that require extensive linguistic skills, such as machine translation. This success has sparked interest in probing whether these models are inducing human-like grammatical knowledge from the raw data they are exposed to and, consequently, whether they can shed new light on long-standing debates concerning the innate structure necessary for language acquisition. In this article, we survey representative studies of the syntactic abilities of deep networks and discuss the broader implications that this work has for theoretical linguistics.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-linguistics-032020-051035
2021-01-04
2024-09-08
Loading full text...

Full text loading...

/deliver/fulltext/linguistics/7/1/annurev-linguistics-032020-051035.html?itemId=/content/journals/10.1146/annurev-linguistics-032020-051035&mimeType=html&fmt=ahah

Literature Cited

  1. Adi Y, Kermany E, Belinkov Y, Lavi O, Goldberg Y 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks Paper presented at the 5th International Conference on Learning Representations (ICLR) Toulon, Fr: Apr 24–26 https://openreview.net/pdf?id=BJh6Ztuxl
    [Google Scholar]
  2. Bahdanau D, Cho K, Bengio Y 2015. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs.CL]
  3. Banko M, Brill E. 2001. Scaling to very very large corpora for natural language disambiguation. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics26–33 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  4. Belinkov Y, Glass J. 2019. Analysis methods in neural language processing: a survey. Trans. Assoc. Comput. Linguist. 7:49–72
    [Google Scholar]
  5. Bernardy J, Lappin S. 2017. Using deep neural networks to learn syntactic agreement. Linguist. Issues Lang. Technol. 15:1–15
    [Google Scholar]
  6. Bock K, Miller C. 1991. Broken agreement. Cogn. Psychol. 23:45–93
    [Google Scholar]
  7. Chaves RP. 2020. What don't RNN language models learn about filler-gap dependencies. Proc. Soc. Comput. Linguist. 3:20–30
    [Google Scholar]
  8. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F et al. 2014. Learning phrase representations using RNN Encoder–Decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)1724–34 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  9. Chomsky N. 1957. Syntactic Structures The Hague, Neth: Mouton
    [Google Scholar]
  10. Chomsky N. 1965. Aspects of the Theory of Syntax Cambridge, MA: MIT Press
    [Google Scholar]
  11. Chomsky N. 1980. Rules and representations. Behav. Brain Sci. 3:1–15
    [Google Scholar]
  12. Chomsky N. 1986. Knowledge of Language: Its Nature, Origin, and Use Westport, CT: Praeger
    [Google Scholar]
  13. Chomsky N. 1995. The Minimalist Program Cambridge, MA: MIT Press
    [Google Scholar]
  14. Chomsky N, Miller GE. 1963. Introduction to the formal analysis of natural languages. In Handbook of Mathematical Psychology 2 R Luce, R Bush, E Galanter 269–321 New York: Wiley
    [Google Scholar]
  15. Chowdhury S, Zamparelli R. 2018. RNN simulations of grammaticality judgments on long-distance dependencies. Proceedings of the 27th International Conference on Computational Linguistics133–44 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  16. Christiansen M, Chater N. 1999. Connectionist natural language processing: the state of the art. Cogn. Sci. 23:417–37
    [Google Scholar]
  17. Chrupała G, Kádár A, Alishahi A 2015. Learning language through pictures. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2112–18 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  18. Churchland P. 1989. A Neurocomputational Perspective: The Nature of Mind and the Structure of Science Cambridge, MA: MIT Press
    [Google Scholar]
  19. Cichy RM, Kaiser D. 2019. Deep neural networks as scientific models. Trends Cogn. Sci. 23:305–17
    [Google Scholar]
  20. Clark A. 1989. Microcognition: Philosophy, Cognitive Science, and Parallel Distributed Processing Cambridge, MA: MIT Press
    [Google Scholar]
  21. Conneau A, Kruszewski G, Lample G, Barrault L, Baroni M 2018. What you can cram into a single $&!#* vector: probing sentence embeddings for linguistic properties. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 12126–36 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  22. Cui Y, Chen Z, Wei S, Wang S, Liu T, Hu G 2017. Attention-over-attention neural networks for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics 1593–602 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  23. Culicover P, Jackendoff R. 2005. Simpler Syntax Oxford, UK: Oxford Univ. Press
    [Google Scholar]
  24. Devlin J, Chang MW, Lee K, Toutanova K 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies4171–86 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  25. Dyer C, Kuncoro A, Ballesteros M, Smith N 2016. Recurrent neural network grammars. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies199–209 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  26. Edunov S, Ott M, Auli M, Grangier D 2018. Understanding back-translation at scale. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing489–500 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  27. Elman JL, Bates EA, Johnson MH, Karmiloff-Smith A, Parisi D, Plunkett K 1998. Rethinking Innateness: A Connectionist Perspective on Development Cambridge, MA: MIT Press
    [Google Scholar]
  28. Everaert M, Huybregts M, Chomsky N, Berwick R, Bolhuis J 2015. Structures, not strings: linguistics as part of the cognitive sciences. Trends Cogn. Sci. 19:729–43
    [Google Scholar]
  29. Fodor J, Pylyshyn Z. 1988. Connectionism and cognitive architecture: a critical analysis. Cognition 28:3–71
    [Google Scholar]
  30. Futrell R, Wilcox E, Morita T, Qian P, Ballesteros M, Levy R 2019. Neural language models as psycholinguistic subjects: representations of syntactic state. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 132–42 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  31. Gibson E, Thomas J. 1999. Memory limitations and structural forgetting: the perception of complex ungrammatical sentences as grammatical. Lang. Cogn. Process. 14:225–48
    [Google Scholar]
  32. Giulianelli M, Harding J, Mohnert F, Hupkes D, Zuidema W 2018. Under the hood: using diagnostic classifiers to investigate and improve how language models track agreement information. See Linzen et al. 2018 240–48
  33. Goldberg A. 2019. Explain Me This: Creativity, Competition, and the Partial Productivity of Constructions Princeton, NJ: Princeton Univ. Press
    [Google Scholar]
  34. Goldberg Y. 2017. Neural Network Methods for Natural Language Processing San Francisco: Morgan & Claypool
    [Google Scholar]
  35. Goldberg Y. 2019. Assessing BERT's syntactic abilities. arXiv:1901.05287 [cs.CL]
  36. Gómez R, Gerken L. 2000. Infant artificial language learning and language acquisition. Trends Cogn. Sci. 4:178–86
    [Google Scholar]
  37. Gulordava K, Bojanowski P, Grave E, Linzen T, Baroni M 2018. Colorless green recurrent networks dream hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 11195–1205 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  38. Hale J, Dyer C, Kuncoro A, Brennan J 2018. Finding syntax in human encephalography with beam search. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics2727–36 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  39. Hart B, Risley TR. 1995. Meaningful Differences in the Everyday Experience of Young American Children Baltimore, MD: Brookes
    [Google Scholar]
  40. Hauser M, Chomsky N, Fitch T 2002. The faculty of language: What is it, who has it, and how did it evolve. Science 298:1569–79
    [Google Scholar]
  41. Haxby J, Gobbini I, Furey M, Ishai A, Schouten J, Pietrini P 2001. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293:2425–30
    [Google Scholar]
  42. Hewitt J, Manning C. 2019. A structural probe for finding syntax in word representations. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 14129–38 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  43. Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Comput 9:1735–80
    [Google Scholar]
  44. Jurafsky D, Martin J. 2008. Speech and Language Processing Upper Saddle River, NJ: Prentice Hall, 2nd. ed.
    [Google Scholar]
  45. Kalchbrenner N, Grefenstette E, Blunsom P 2014. A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1655–65 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  46. Kuncoro A, Dyer C, Hale J, Blunsom P 2018a. The perils of natural behaviour tests for unnatural models: the case of number agreement. Poster presented at Learning Language in Humans and in Machines Paris, Fr: July 5–6 https://osf.io/9usyt/
  47. Kuncoro A, Dyer C, Hale J, Yogatama D, Clark S, Blunsom P 2018b. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 11426–36 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  48. Lakretz Y, Dehaene S, King JR 2020. What limits our capacity to process nested long-range dependencies in sentence comprehension. Entropy 22:446
    [Google Scholar]
  49. Lakretz Y, Kruszewski G, Desbordes T, Hupkes D, Dehaene S, Baroni M 2019. The emergence of number and syntax units in LSTM language models. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies11–20 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  50. Lasnik H, Lidz J. 2017. The argument from the poverty of the stimulus. Oxford Handbook of Universal Grammar I Roberts 221–48 Oxford, UK: Oxford Univ. Press
    [Google Scholar]
  51. LeCun Y, Bengio Y, Hinton G 2015. Deep learning. Nature 521:436–44
    [Google Scholar]
  52. Leshno M, Lin V, Pinkus A, Schocken S 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw 6:861–67
    [Google Scholar]
  53. Linzen T, Chrupała G, Alishahi A 2018. The 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP: Proceedings of the First Workshop Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  54. Linzen T, Chrupała G, Belinkov Y, Hupkes D 2019. The BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP at ACL 2019: Proceedings of the Second Workshop Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  55. Linzen T, Dupoux E, Goldberg Y 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Trans. Assoc. Comput. Linguist. 4:521–35
    [Google Scholar]
  56. Linzen T, Leonard B. 2018. Distinct patterns of syntactic agreement errors in recurrent networks and humans. Proceedings of the 40th Annual Conference of the Cognitive Science Society692–97 Austin, TX: Cogn. Sci. Soc.
    [Google Scholar]
  57. Marvin R, Linzen T. 2018. Targeted syntactic evaluation of language models. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing1192–1202 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  58. McCoy T, Frank R, Linzen T 2018. Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks. Proceedings of the 40th Annual Conference of the Cognitive Science Society2093–98 Austin, TX: Cogn. Sci. Soc.
    [Google Scholar]
  59. McCoy T, Frank R, Linzen T 2020. Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks. Trans. Assoc. Comput. Linguist. 8:125–40
    [Google Scholar]
  60. Mitchell TM. 1980. The need for biases in learning generalizations Tech. Rep., Rutgers Univ New Brunswick, NJ:
    [Google Scholar]
  61. Pinker S, Jackendoff R. 2005. The faculty of language: What's special about it. Cognition 95:201–36
    [Google Scholar]
  62. Pinker S, Prince A. 1988. On language and connectionism: analysis of a parallel distributed processing model of language acquisition. Cognition 28:73–193
    [Google Scholar]
  63. Pollack JB. 1990. Recursive distributed representations. Artif. Intel. 46:77–105
    [Google Scholar]
  64. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I 2019. Language models are unsupervised multitask learners Work. Pap., OpenAI San Francisco: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
    [Google Scholar]
  65. Raffel C, Shazeer N, Roberts A, Lee K, Narang S et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683 [cs.LG]
  66. Ravfogel S, Goldberg Y, Tyers F 2018. Can LSTM learn to capture agreement? The case of Basque. See Linzen et al. 2018 98–107
  67. Rogers A, Kovaleva O, Rumshisky A 2020. A primer in BERTology: what we know about how BERT works. arXiv:2002.12327 [cs.CL]
  68. Ross J. 1967. Constraints on variables in syntax PhD Diss., Mass. Inst. Technol Cambridge, MA:
    [Google Scholar]
  69. Shi X, Padhi I, Knight K 2016. Does string-based neural MT learn source syntax?. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing1526–34 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  70. Socher R, Lin CC, Ng AY, Manning CD 2011. Parsing natural scenes and natural language with recursive neural networks. ICML'11: Proceedings of the 28th International Conference on Machine Learning129–36 Madison, WI: Omnipress
    [Google Scholar]
  71. Sutskever I, Vinyals O, Le QV 2014. Sequence to sequence learning with neural networks. NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems 2 Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger 3104–12 Cambridge, MA: MIT Press
    [Google Scholar]
  72. Tran K, Bisazza A, Monz C 2018. The importance of being recurrent for modeling hierarchical structure. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing4731–36 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  73. van Schijndel M, Linzen T 2018. Modeling garden path effects without explicit hierarchical syntax. Proceedings of the 40th Annual Conference of the Cognitive Science Society T Rogers, M Rau, J Zhu, C Kalish 2603–8 Austin, TX: Cogn. Sci. Soc.
    [Google Scholar]
  74. van Schijndel M, Mueller A, Linzen T 2019. Quantity doesn't buy quality syntax with neural language models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)5831–37 Stroudsburg, PA: Assoc. Comput. Linguist.
    [Google Scholar]
  75. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. 2017. Attention is all you need. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems U von Luxburg 6000–10 Red Hook, NY: Curran
    [Google Scholar]
  76. Warstadt A, Parrish A, Liu H, Mohananey A, Peng W et al. 2019. BLiMP: the benchmark of linguistic minimal pairs for English. arXiv:1912.00582 [cs.CL]
  77. Weston J. 2016. Dialog-based language learning. NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems DD Lee 829–37 Red Hook, NY: Curran
    [Google Scholar]
  78. Wilcox E, Levy R, Morita T, Futrell R 2018. What do RNN language models learn about filler–gap dependencies?. See Linzen et al. 2018 211–21
/content/journals/10.1146/annurev-linguistics-032020-051035
Loading
/content/journals/10.1146/annurev-linguistics-032020-051035
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error