Computational Models of Anaphora

Interpreting anaphoric references is a fundamental aspect of our language competence that has long attracted the attention of computational linguists. The appearance of ever-larger anaphorically annotated data sets covering more and more anaphoric phenomena in ever-greater detail has spurred the development of increasingly more sophisticated computational models; as a result, the most recent state-of-the-art neural models are able to achieve impressive performance by leveraging linguistic, lexical, discourse, and en-cyclopedic information. This article provides a thorough survey of anaphora resolution (coreference) throughout this development, reviewing the available data sets and covering both the preneural history of the field and—in more detail—current neural models, including research on less-studied aspects of anaphoric interpretation such as bridging reference resolution and discourse deixis interpretation.


INTRODUCTION
Interpreting anaphoric references is an aspect of our linguistic competence that has attracted much interest from theoretical, psycho-, and computational linguists, in part because it straddles sentential and intersentential interpretation; in part because it draws on all types of information, from lexical to syntactic to contextual information to commonsense knowledge; and in part, finally, because human judgments on anaphoric interpretation are much sharper than judgments on aspects of interpretation such as rhetorical structure or even syntax.Evidence from anaphoric reference has played a key role in the development of modern theories of syntax (e.g., binding; Büring 2005), of discourse models and their role in semantics (Karttunen 1976, Webber 1979, Heim 1982, Kamp et al. 2011), and of salience and its role in interpretation [Sidner 1979, Grosz & Sidner 1986, Grosz et al. 1995(1986)].
The first author of this review coauthored a book on anaphora resolution fairly recently (Poesio et al. 2016b).However, CL moves fast.By the time that book was completed, the field had already changed dramatically with the appearance of the first neural models.In addition, that book did not cover aspects of anaphoric reference that had not been extensively studied in CL until very recently, such as bridging reference, discourse deixis, and the interpretation of plurals (Hou et al. 2013, Marasović et al. 2017, Hou et al. 2018, Roesiger et al. 2018, Yu & Poesio 2020, Yu et al. 2021).This review thus aims to provide a more complete (if more succinct) survey of the area, including those developments not covered in the book by Poesio et al. (2016b).

THE COMPUTATIONAL PERSPECTIVE ON ANAPHORA
In this review, we do not attempt to provide a full introduction to the linguistics and psycholinguistics of anaphora, which are well covered in works by, for instance, Kamp & Reyle (1993), Garnham (2001), Büring (2005), and Gundel & Abbott (2019) as well as Poesio et al. (2016b).We instead concentrate on the aspects of anaphora most studied in CL. by depending in part or entirely on the entities mentioned in the linguistic context-the previous utterances and their content.Such dependency is particularly obvious in the case of pronouns like he, him, or his in the following text, whose interpretation entirely depends on which entity is chosen as the antecedent.But other types of noun phrases (NPs) depend on the linguistic context for their interpretation as well, including nominals such as your father or even proper nouns like the second instance of Maupin in example 1, which could refer either to Armistead Jones Maupin Jr. (the author of Tales of the City) or his father, Armistead Jones Maupin (who served in the US Navy): (1) [Maupin] i recalls [his]  Most computational models of anaphoric reference interpretation (a task we refer to here as anaphora resolution) tend to be based on (some version of ) the discourse model approach to the semantics of anaphora pioneered by Bransford and colleagues in psycholinguistics (for a review, see Garnham 2001) and by Karttunen (1976) in theoretical linguistics and Webber (1979) in CL, which led to dynamic semantics theories (Kamp et al. 2011).In such models, interpretation takes place against a context that consists of discourse entities; each new sentence may contain references to these discourse entities and/or result in new entities being added to the context.In CL, discourse entities typically take the form of coreference chains: clusters of mentions all referring to the same entity.

Anaphora and coreference.
Early work focused primarily on pronominal anaphoric reference, but ever since the appearance of the first substantial anaphorically annotated corpora and in particular since the first classic model of coreference resolution (Soon et al. 2001), most research has been concerned with developing models capable of interpreting all types of reference to discourse entities via nominals.Yet, in much CL/natural language processing literature a distinction is still made between anaphora resolution and coreference resolution, and the term anaphora is used to indicate pronominal anaphora only.In this review, the terms anaphora and anaphoric reference are used in the more general sense of reference to entities in the discourse model used in semantics (see, e.g., Lyons 1977, Kamp & Reyle 1993) and psycholinguistics (see, e.g., Garnham 2001).In Discourse Representation Theory (DRT) (Kamp et al. 2011), for instance, the proper name Maupin in example 1 adds to the discourse model a new discourse entity i, and all subsequent mentions of Maupin, whether using pronouns or proper names, are interpreted as anaphoric references to entity i. past two.There is substantial disagreement in CL on whether all types of NPs or only referring expressions should be annotated in a corpus for anaphora (Poesio et al. 2016a).
2.1.4.Incorporated and zero anaphora.In languages other than English, anaphoric reference can be expressed implicitly, or the anaphora can be incorporated in a nonnominal constituent such as a verb.A great deal of attention has been paid in CL to the identification and interpretation of zero anaphora-anaphoric references in which a verbal argument is not realized, which occur for languages such as Arabic, Chinese, Italian, Japanese, and Spanish.
2.1.5.Constraints on anaphoric interpretation.Syntactic (Büring 2005) and semantic (Karttunen 1976, Heim 1982, Kamp & Reyle 1993) constraints on anaphora have played an important role in linguistic theorizing but only a limited one in recent computational models of anaphora.On the other end, there has been extensive work on the pragmatic effects of discourse structure on anaphoric reference, which is briefly discussed in Section 2.2.

Associative anaphora (bridging).
Most computational models of anaphora focus on identity relations, largely because of the coverage of existing data sets (see Section 3).However, there has been much interest in associative anaphora as well (Clark 1977), where the anaphoric expression is related to its antecedent by a relation other than identity, as in example 2, in which the kitchen and the garden are associated with the flat introduced in the first sentence.This type of anaphoric reference is usually called a bridging reference in CL because a bridging inference is generally required to identify the antecedent (Clark 1977): (2) We saw [a flat] i yesterday.[The kitchen] i j is spacious but [the garden] i k is very small.

Other cases of anaphoric reference to antecedents not explicitly introduced with nominals.
Other cases of anaphoric reference to antecedents not introduced via nominals have also been studied in CL (Eschenbach et al. 1989, Webber 1991, Kolhatkar et al. 2018).One is discourse deixis, or anaphora with nonnominal antecedents (Webber 1991, Kolhatkar et al. 2018; see example 3).This is a type of anaphora in which the antecedent is an abstract entity associated with the propositional content of a segment: (3) The municipal council had to decide [whether to balance the budget by raising revenue or cutting spending] i .The council had to come to a resolution by the end of the month.
[This issue] i was dividing communities across the country.
Interpreting some cases of anaphoric reference requires updating the context via some explicit interpretation.The simplest among these are the cases of split antecedent anaphora studied by Eschenbach et al. (1989) and Kamp & Reyle (1993) and illustrated in example 4. The antecedent for they is a plural entity that is not explicitly mentioned but somehow constructed out of the explicitly mentioned Michael and Maria: (4) [Michael] i was at the cinema with [Maria] j .
[They] i+j had a great time.This minimal pair recently acquired great prominence as the first example of what has become known as the Winograd Schema approach to evaluating anaphora resolution proposed by Levesque et al. (2012) (see Section 3).

Syntactic constraints.
The prohibition for him to corefer with John in * John i likes him i played an important role in linguistic theorizing, as discussed above (Büring 2005).Such constraints also played an important role in early models of pronominal interpretation such as Hobbs's "naive algorithm" (Hobbs 1978) but not in recent models.

Discourse factors.
It has long been known that more recently introduced entities are more likely antecedents; in CL, Hobbs (1978), for instance, reported that in his corpus, 98% of pronoun antecedents were in the current or the previous sentence.A stronger hypothesis is that linguistic focusing mechanisms-attentional mechanisms of the type found in visual interpretation-also affect the interpretation of anaphoric expressions (Grosz 1977, Sidner 1979, Sanford & Garrod 1981).According to the best-known theory of this type in CL, proposed by Grosz & Sidner (1986), two levels of structure exist in discourse: the global focus, which specifies the articulation of discourse segments; and the local focus, which specifies how, utterance by utterance, the relative salience of entities changes.Authors such as Grosz & Sidner (1986), Mann & Thompson (1988), Webber (1991), and Asher & Lascarides (2003) have argued that discourse segments have a hierarchical structure that affects anaphoric interpretation (see, e.g., Fox 1987 for an analysis of some of these claims).Sidner (1979) proposed the first detailed theory of the local focus; Centering [Grosz et al. 1995[Grosz et al. (1986))] eventually evolved into the dominant theory of the local focus in CL and, to some extent, in psycholinguistics (Walker et al. 1998, Poesio et al. 2004b).

Ambiguity
One property of anaphoric reference that was not extensively studied in either the linguistic or the psycholinguistic literature on anaphora but that has been highlighted from large-scale anaphoric annotation efforts in CL is the fact that many anaphoric expressions do not have a preferred interpretation in context (Poesio & Artstein 2005, Recasens et al. 2011).The prevalence of ambiguous cases in anaphorically annotated corpora ranges from 10-15% in more formal texts (Pradhan et al. 2012, Poesio et al. 2019) to 30-40% in dialogue data and when discourse deixis is also annotated (Poesio & Artstein 2005).This evidence suggests that for a proper empirical 3.1.Annotating Nominal Anaphora: The Options 3.1.1.The definition of markable.The prototypical markable-the item to annotate-in most anaphoric data sets is the NP, generally considered in its entirety; however, some differences exist between data sets, and other sentence constituents are also considered, such as possessive pronouns as well as zeros for languages such as Arabic, Chinese, Italian, and Japanese.Semantic and discourse restrictions on the definition of markable are often also imposed.In particular, very few corpora attempt to annotate all types of NPs discussed in Section 2.1 (Poesio et al. 2016a).So, for instance, in the most used anaphoric data set for Arabic, Chinese, and English-OntoNotes (Pradhan et al. 2012)-only some types of referring NPs and some types of predicative NPs are annotated (see Section 2.1); other types of predicative NPs, expletives, and other types of nonreferring NPs are not.In fact, in OntoNotes and other data sets, only NPs that refer to entities mentioned more than once are annotated; so-called singletons are not.As a result, most CL work on anaphora resolution focuses on referring expressions only and does not attempt to resolve ambiguities such as those between the expletive or anaphoric interpretation of it and the predicative or referential interpretation of some indefinite NP.As an additional restriction, some data sets created to study the effect of anaphora on information extraction only annotate markables denoting certain types of entities; for instance, in the ace corpora (Doddington et al. 2000), only NPs that refer to a few types of entities are annotated (e.g., persons, organizations), and others are not (e.g., references to animals, art objects, substances).

Predication.
One of the most discussed properties of the annotation schemes used for the original information-extraction-led data sets such as muc and ace (Chinchor & Sundheim 1995, Doddington et al. 2000) was the inclusion in "coreference resolution" of what linguistically would be considered cases of predication.In these corpora, a good man would be marked as coreferring with Maupin's father in the third sentence of example 1.This approach raised the problems discussed by, among others, van Deemter & Kibble (2000), leading, for instance, to implausible coreference relations when predications change over time, such as net income in example 8 (from the wsj portion of the arrau corpus).Contemporary corpora greatly differ with respect to how they treat predication (for more discussion, see

The range of relations.
All anaphoric data sets annotate identity-mentioning again a previously mentioned entity-although as we have seen, some data sets only consider identity relations between a subset of the mentions.For many years only small, dedicated data sets were available to study bridging reference resolution, such as gnome (Poesio et al. 2004a) and isnotes (Markert et al. 2012).However, bridging references are annotated in many if not most of the more recent larger data sets (see Table 1).But it should be noted that there is much less agreement on the annotation schemes for bridging reference than on those for identity reference (Poesio et al. 2016a, Roesiger et al. 2018).Even smaller is the number of annotation projects that cover discourse deixis, but again the number is growing (see Table 1).However, substantial differences exist between the guidelines adopted in these different projects (for details, see Kolhatkar et al. 2018).

Ambiguity.
Only a few corpora provide information about cases of anaphoric ambiguity.In arrau, ambiguity is marked explicitly-annotators can provide multiple interpretations.In Phrase Detectives, ambiguity is marked implicitly-annotators can provide only one interpretation, but because a large number of players provide judgments for each markable (20 on average), disagreements in interpretation can emerge.In ancora and the pcc, annotators can use a relation of quasi-identity when coreference is possible but not certain.

Full-Text Corpora for Anaphora
3.2.1.Early corpora.The earliest anaphoric data set we are aware of is the ibm/ucrel Anaphoric Treebank (McEnery et al. 1997).This resource was annotated according to a linguistically motivated scheme, arguably the most ambitious anaphoric scheme tried so far, covering not only bridging and discourse deixis but also various types of ellipsis.Unfortunately, however, the resource was never made publicly available.So the data sets that really kick-started the data-driven shift in anaphora resolution were the corpora created for the Message Understanding Conference (muc) and Automatic Content Extraction (ace) shared tasks (Chinchor & Sundheim 1995, Doddington et al. 2000).These shared tasks also introduced the coreference task as currently understood, in terms of terminology (e.g., use of the term "mentions" to refer to the items to classify) and of focus on nominal anaphora only.Equally importantly, these shared tasks led to the introduction of the first evaluation metrics designed specifically for anaphora (see Section 5).However, the task definition also raised issues such as the conflation of predication and anaphoric reference discussed earlier, or, in ace, the restriction on the range of entities considered.

Linguistically motivated data sets.
The discussions about the specification of the coreference task in muc and ace (van Deemter & Kibble 2000) eventually led to proposals for the annotation of anaphoric information (Passonneau 1997, Poesio et al. 1999) that were more directly based on the linguistic approach to anaphora discussed in Section 2.1.Most of the corpora developed since have adopted a similar approach (Poesio 2004, Hinrichs et al. 2005, Hendrickx et al. 2008, Poesio & Artstein 2008, Nedoluzhko et al. 2009, Recasens & Martí 2010, Pradhan et al. 2012, Ogrodniczuk et al. 2015, Landragin 2016, Zeldes 2020).In particular, the creation of OntoNotes (Pradhan et al. 2012) and the shared tasks based on OntoNotes and other data sets of this type (Recasens et al. 2010, Pradhan et al. 2012) led to a move away from the modeling of coreference in the sense of muc and ace and toward anaphora resolution as traditionally conceived in linguistics and psychology.

Genres.
Most of the early data sets focused on news articles and broadcasts, but the more recent data sets cover other genres as well.This is important because the news genre provides a skewed picture of the use of anaphoric reference in language; focusing exclusively on such data limits both the generality of the linguistic findings and the usefulness of models trained on the data when applied to other domains (Xia & Durme 2021).Substantial corpora now exist for studying anaphora resolution in the biomedical domain, including, for instance, genia (Yang et al. 2004) and craft (Cohen et al. 2017) Sundheim 1995), mate (Poesio et al. 1999), or m/o for the version of the mate guidelines developed for OntoNotes (Pradhan et al. 2012)]; whether nonreferring expressions (NR?), bridging references (BR?), and discourse deixis (DD?) are annotated; and whether multiple interpretations for ambiguous markables are included.

Benchmarks
As discussed above, the first computational models of anaphora were tested against benchmarks containing examples of whichever aspect of anaphoric interpretation a model was developed to handle (Hobbs 1978, Carter 1987).However, this approach was of limited use in assessing the performance of a computational model on real text, so full-text evaluation became the standard approach to evaluation in the field once data sets like muc became available.But in recent years the realization has been growing that full-text evaluation has limitations too-namely, that because of the prevalence of relatively easy cases in test data sets, a high score may not indicate truly good performance (Barbu & Mitkov 2001, Webster et al. 2018)-the more so when many of the hard cases are excluded a priori because of insufficient agreement (Poesio & Artstein 2005, Recasens et al. 2011).As a result, we are witnessing a return to benchmarks as a way of evaluating anaphora resolution.

Resolving gender-ambiguous cases.
To test the ability of systems to resolve pronouns without the help of gender cues, the gap data set was launched (Webster et al. 2018).

Remaining Gaps
We are in a much better situation than at the beginning of the data-driven era, and quality data sets of a substantial size are now available for many languages.But significant gaps remain.For one thing, many languages are still not covered or are covered only by relatively small data sets; for instance, the largest available data sets for Arabic just pass the 300,000-token threshold used for Table 1, and the only data sets we are aware of for some of the most spoken languages in the world-Bengali, Hindi, Portuguese, Russian, and Turkish-do not.Also, the focus so far has been mainly on written language.Very few data sets cover spoken language, and we are aware of only one large corpus of spoken language annotated for coreference: the ancor corpus of spoken French (Muzerelle et al. 2014).[A medium-sized dialogue corpus for English was recently created for the codi/crac shared task on anaphora in dialogue (Khosla et al. 2021).]Last but not least, there are still many aspects of anaphoric interpretation (e.g., discourse deixis) for which we lack a solid theoretical foundation.And even our understanding of identity anaphora as reflected in the guidelines used is still partial; Zaenen's admonitions that, for instance, "The problems with the 'coreference' annotation tasks of muc and the like are well documented and not solved" (Zaenen 2006, p. 578) and the Universal Anaphora pages).The data sets for biomedical information are surveyed by Cohen et al. (2017), and the benchmarks for the Winograd Schema Challenge by Kocijan et al. (2020).Ide & Pustejovsky (2017) provide more in-depth discussion of some corpora.

THE PRENEURAL PERIOD
The history of anaphora resolution research can be divided broadly into three periods.In the first period, which lasted until the early 1990s, cognitively and linguistically motivated models were tested on a few select examples.In the second, data-driven period, shallower and then statistical models were developed that could be tested on increasingly larger amounts of full texts.Finally, in the current, neural-nets-based period, the extraction of syntactic and semantic information from the text is left almost entirely to the models themselves, lexical and commonsense information is encoded using embeddings, and attentional mechanisms are incorporated in the architecture.The preneural models have been covered in great detail by Poesio et al. (2016b), so in this section we provide only a short summary of that work, and in the next section we dedicate more space to the current state of art.In Sections 4 and 6 we focus on identity reference; the other types of anaphora resolution are covered in Section 7.

Cognitively and Linguistically Rooted Early Models
The computational models proposed in the early years of research in anaphora resolution were rooted very directly in findings about anaphora from linguistic and psycholinguistic studies such as those discussed in Section 2. They focused on testing the predictions of cognitive and linguistic theories of anaphoric interpretation and therefore generally assumed a perfect syntactic and semantic analysis of the input as a starting point for anaphora resolution and/or assumed that all the needed commonsense knowledge was available.

Syntax-based algorithms.
One of the main strands of research on computational models of anaphora focused on testing syntactic constraints and preferences on pronoun resolution.The most influential work in this area is the so-called Hobbs algorithm (Hobbs 1978), which incorporates the syntactic constraints and preferences discussed in Section 2 and provided a competitive baseline for pronoun resolution well into the data-driven period.

Commonsense knowledge and inference-based approaches.
Much of the early work on anaphora resolution in CL (and psychology) was devoted to providing an account of the effects of commonsense knowledge and inference on the interpretation of anaphoric expressions like the one seen in example 7. The most developed proposal was the Interpretation as Abduction formal account of the inferences involved in interpreting anaphoric reference and other aspects of language interpretation, implemented in the tacitus system (Hobbs et al. 1993).This account is possibly the most detailed one of inference in anaphora resolution, together with the less formal account by Carter (1987).But the first muc shared tasks revealed that this approach would not scale, and thus there was a shift toward more heuristic systems in the subsequent editions of muc.

Salience.
The most detailed computational model of the effect of salience on the interpretation of anaphoric expressions was Sidner's (1979) focus model.In this model, focal information is used to generate salience-based preferences using very detailed (and very complex) rules, which are then assessed by commonsense inference.Two lines of research emerged from Sidner's proposals.Carter (1987) proposed a detailed model about the role of salience in interpretation and its integration with commonsense inference.A second line of work pursued simpler models of salience, leading to the development of Grosz & Sidner's (1986) theory of discourse structure as well as Centering theory [Grosz et al. 1995(1986), Walker et al. 1998].Several computational models of Centering were proposed.Tetreault's Left-to-Right Centering algorithm (Tetreault 2001) was tested on a substantial data set and performed slightly better than Hobbs's algorithm.A different approach to modeling attention in anaphora resolution was the development of so-called activation-based models of salience (e.g., Lappin & Leass 1994).Such models do not hypothesize the existence of foci or centers; instead, each discourse entity has an activation level that is affected by a variety of factors.

Formal approaches to discourse model construction.
The development of DRT and other dynamic logics led to computational models of anaphora resolution based on such theories (Alshawi 1992, Poesio 1994, Bos 2004).The best known of these models is sri's Core Language Engine (Alshawi 1992), which is used in a number of domain-restricted practical applications.Bos's (2004) boxer model for DRT-based semantic interpretation has been shown to be usable for large-scale semantic interpretation.

Heuristic and Knowledge-Poor Approaches
The muc shared tasks led to a shift toward models that could be tested on a larger scale.The key characteristic of these models is that they had to perform with much more limited knowledge than the models discussed above.Unlike the early syntax-based algorithms, they could no longer assume perfect, hand-produced syntactic and semantic knowledge about the input.Instead, they had to rely on the partial or imperfect syntactic analysis produced by existing automatic parsers.And unlike knowledge-based systems, they could not expect a knowledge base that contained a complete set of axioms for all the concepts encountered.Instead, they had to rely on approximate lexical knowledge sources such as WordNet.This was the first dramatic change on the path toward fully data-driven computational models.Another change was that whereas some of these models, like most previous systems, focused on a single type of anaphoric expression, such as pronouns (Baldwin 1997, Mitkov 1998) or definite descriptions (Vieira & Poesio 2000), the systems participating in those early shared tasks had to handle all types of nominals (Kameyama 1997, Humphreys et al. 1998).
The most lasting innovation of the heuristic-based systems of this period is the precisionfirst architecture pioneered by cogniac (Baldwin 1997) (which, until recently, was still used in the Stanford Deterministic Coreference Resolver; Lee et al. 2013) and systems based on this approach.cogniac resolves pronouns by applying a series of rules ordered so that the most reliable apply first.The same strategy was adopted in the hand-coded version of the Viera/Poesio system (Vieira & Poesio 2000).The precision-first architecture was revived in the Stanford Deterministic Coreference Resolver for the 2011 conll shared task (Lee et al. 2013).The success of the Stanford Deterministic system was by all accounts due to two characteristics.First of all, the system employed a high recall and high precision component for detecting mentions.The performance of mention detection is to this day one of the most important factors in anaphora resolution.Secondly, the mentions thus extracted were processed by 10 heuristic rules, or sieves, ordered from the most accurate to the least accurate.The Stanford Sieve approach is still the best way to develop an anaphoric resolver for a language for which there are no annotated data sets.

LI09CH28_Poesio
ARjats.cls November 4, 2022 14:14 Some of these models were essentially versions of the heuristic systems in which the optimal order among the heuristics was learned from the data, but soon more advanced models appeared-in particular, the mention-pair model proposed by Aone & Bennett (1995) and made popular by Soon et al. (2001).The mention-pair model is a simple way to recast anaphora resolution as a classification task: The model is trained to decide whether the two mentions (markables) within a pair corefer."Resolving" potential anaphor m j is thus viewed as the task of finding the mention m i whose probability of coreferring with m j is maximal: argmax m i P(C = 1| m i , m j ).A coreference resolver based on this architecture 1. goes through the markables in a text (generally, but not always, in the order specified by the text); 2. for each markable m j identifies a set of possible candidate antecedents; and 3. for each m j , candidate antecedent m i pair, extracts a number of features (see below) and uses them to compute the probability of C = 1.
Once all mentions have been classified, a clustering algorithm is used to build coreference chains out of the anaphoric links identified by the model.The Soon et al. model was the reference architecture for anaphora resolution for many years.

Entity-mention and mention-ranking models.
From a linguistic and cognitive perspective, viewing anaphora resolution as a mention-pairing task is a drastic simplification of discourse model construction that, for instance, would appear unable to handle anaphoric references to entities not introduced via NPs.From a machine learning perspective as well, this approach would appear limited as it considers only mention and mention-pair features, not features of entities.These shortcomings led to the development of so-called entity-mention models in which mentions are directly linked to entities (clusters), as done in the prestatistical models (e.g., Luo et al. 2004, Culotta et al. 2007).A second respect in which many models have diverged from the Soon et al. (2001) architecture is the use of the "best first" approach to antecedent selection: considering multiple antecedents in parallel and choosing the one that is highest ranked instead of considering one candidate at a time.The cluster ranking model by Rahman & Ng (2011), which combined entity-mention architecture with a ranking approach, achieved state-of-the-art results for its time.

Extended feature sets.
Another active line of research focused on improving on the Soon et al. (2001) model by employing a richer set of features.This work led to the hypothesis that richer feature sets could lead to improvements only with larger data sets than muc.This hypothesis was indirectly confirmed by Bengtson & Roth (2008), who found that when testing on a larger data set-the ace 2004 corpus-a simple mention-pair model using carefully chosen features could outperform the state-of-the-art system by Culotta et al. (2007).Research was also carried out on methods for mining the values of these features (Bergsma 2016).

Lexical and commonsense knowledge.
Features that encode the lexical and commonsense knowledge required during anaphora resolution include selectional restrictions on the interpretation of pronouns (Kehler et al. 2004, Ponzetto & Strube 2006) and synonymy information and encyclopedic knowledge for interpreting nominals (Vieira & Poesio 2000, Ponzetto & Strube 2006).Another line of research focused on leveraging existing knowledge bases, such as WordNet, FrameNet, and Wikipedia (Vieira & Poesio 2000, Ponzetto & Strube 2006).One of the best-known models of this type, by Ponzetto & Strube (2006), used WordNet for lexical synonymy information, used FrameNet for selectional restrictions, and pioneered the use of Wikipedia for encyclopedic knowledge.These resources were further exploited in many models using extended feature sets, such as those of Daume & Marcu (2005), Bengtson & Roth (2008), Rahman &Ng (2011), andDurrett &Klein (2013).In yet another line of research, distributional semantics was used to acquire semantic information from corpora (Versley et al. 2016).Several analyses of the effectiveness of lexical and commonsense knowledge for anaphora resolution (Durrett & Klein 2013, Versley et al. 2016) found the results disappointing, but as discussed below, much better results were obtained later using contextual embeddings to encode such knowledge in neural approaches.

Joint inference.
Several interpretative tasks that affect anaphora resolution are best carried out jointly with it.One example is anaphoricity detection (Poesio & Vieira 1998, Ng & Cardie 2002a, Denis & Balridge 2007, Uryupina et al. 2016).Another example is mention detection: Daume & Marcu (2005), for instance, showed that this task, too, is best performed jointly with anaphora resolution-a finding that lies at the core of the "end-to-end" neural model currently dominating anaphora resolution and discussed in Section 6 (Lee et al. 2017).The realization that many such tasks are best performed jointly led to numerous models adopting joint inference architectures such as the Integer Linear Programming (ilp) model (Rizzolo & Roth 2016), a form of constraint programming in which constraints can be imposed on variables modeling the outcome of separate classifiers (e.g., for coreference and anaphoricity detection).The use of ilp for coreference was popularized by Denis & Balridge (2007), who applied the framework to joint anaphoricity detection and anaphora resolution, but this approach has been widely applied in anaphora resolution (Iida & Poesio 2011).

Graph-and tree-based architectures.
Many of the most successful statistical models for the conll 2012 data set were based on a formulation of coreference resolution in terms of an underlying graph structure whose nodes are the mentions in a document.Two families of methods can be identified.Nicolae & Nicolae (2006) used a graph structure where the edges between the nodes encode degrees of semantic compatibility or incompatibility judgments between the mentions.A second line of research involves growing mention trees for a document, where attachment to a branch of a tree indicates coreference.Fernandes et al. (2014), who developed the top performing system at the conll 2012 shared task, formulated coreference resolution as the problem of recovering a latent coreference tree for a document, encoding the most likely coreference relations.Martschat & Strube (2015) argued that several popular architectures for coreference-the mention-pair model, the mention-ranking model, and the latent coreference trees model-could in fact be viewed as predicting different types of latent structures, and they developed a unified framework for training such models by using the latent structure perceptron algorithm.

Identity Anaphora in Languages Other than English
Research on zero anaphora resolution played a key role in early computational work on anaphora-for instance, in the development of Centering (Kameyama 1985).Zero anaphora resolution has remained an active area of study for Japanese because of the prevalence of zeros in the language and the availability of the naist corpus (Iida et al. 2007, Sasano et al. 2009).
But the release of OntoNotes spurred much research on zero pronoun anaphora in Chinese (Chen & Ng 2016) and Arabic (Aloraini & Poesio 2020) as well.A noteworthy characteristic of work on zero anaphora is that many proposals are multilingual (Iida & Poesio 2011, Aloraini & Poesio 2020); this is still sadly rare in the field notwithstanding the availability of a number of multilingual data sets.Most topics discussed in this section are covered in greater detail by chapters in Poesio et al. (2016b).For early and heuristic models, readers are referred to Poesio et al. (2016c) (and Mitkov 2002 for more in-depth coverage).The mention-pair model with its variants is discussed in Hoste (2016).More advanced models including the entity-mention model are discussed in Ng (2016).

EVALUATION
One of the fundamental issues in anaphora resolution is that although the field has converged on an "official" metric that has driven progress for the last 10 years, it is far from clear that this metric captures our intuitions about how anaphoric interpretation should be evaluated-or indeed, what these intuitions are.This is not to say that the field is completely divided.For instance, it is universally accepted that evaluation should be entity based instead of mention based, in the sense that a system's interpretation of example 9 should be evaluated on the extent to which it recognizes that 1, 2, and 3 are all mentions of the same entity, as opposed to merely its ability to link 2 to 1 and 3 to 2. (9) [Mary] 1 i woke up late that morning, so [she] 2 i rushed out of bed-[she] 3 i had an important meeting.
For this reason, ever since the first muc shared task, precision, recall, and F value for anaphora resolution have been used to assess a system's ability to identify the entire set of mentions of an entity (aka the coreference chain).However, agreement on this point still leaves many degrees of freedom.As a result, different ways have been proposed to compare the coreference chains in the gold annotation (in anaphora resolution, this is generally known as the "keys") with those produced by a system (known as "responses"), and no consensus has been reached on which metric is most appropriate.This impasse was broken by Denis & Balridge (2007), who introduced a measure based on muc, b 3 , and the Constrained Entity-Aligned F-Measure (ceaf) that was adopted in the conll 2011 and 2012 shared tasks and has since become standard.

A Link-Based Metric: The MUC Score
The muc official scorer (Vilain et al. 1995) introduced a link-based metric.A link-based metric measures the extent to which the links in the response match the links in the key.For example, recall is computed by summing up the correctly recalled links for each coreference chain in the key and then dividing by the total number of correct links in the key.The number of missing linksthe links found in the key entities but not in the response entities-is computed by counting the number of partitions of key K induced by response R, as follows: where P (K i ; R) is the partition function, which returns all the partitions of key entity K i with respect to a system's response R. 1 Precision is computed by summing up the correct links in each coreference chain in the response and dividing that by the total number of links in the responsethat is, by swapping key and response in the formula above.

A Mention-Based Metric: B 3
One problem with the muc score is that, by definition, it only scores a system's ability to identify links between mentions; its ability to recognize that a mention does not belong to any coreference 1 The original coreference chain gets partitioned into k + 1 subsets when k links are missing: One missing link results in two coreference chains, two missing links in three coreference chains, and so forth.Notice also that only n − 1 links are required to link all the mentions in a chain of size n.

LI09CH28_Poesio
ARjats.cls November 4, 2022 14:14 chain-that is, its ability to classify a mention as a singleton-does not get any reward.The b 3 metric (Bagga & Baldwin 1998) was proposed to correct this problem.It does this by computing recall and precision for each mention m, even if m is a singleton.b 3 computes the intersection |K i R j | between every coreference chain K i in the key and every coreference chain R j in the response and then sums up recall and precision for each pair i, j and normalizes.In turn, recall and precision for i, j are computed by summing up recall and precision for each mention m in |K i R j |.For instance, recall for m is the proportion of mentions in K i R j and the number of mentions in K i : Precision for m is the proportion between |K i R j | and |R j |.

An Entity-Based Metric: CEAF
b 3 also suffers from a problem-namely, that a single chain in the key or response can be credited several times.This leads to anomalies; for instance, if all coreference chains in the key are merged into one in the response, the b 3 recall is one.The ceaf metric was proposed by Luo (2005) to correct this problem.The key idea of ceaf is to align chains (entities) in the key and response using a map g in such a way that each chain K i in the key is aligned with only one chain g(K i ) in the response and to then use the similarity φ(K i , g(K i )) to compute recall and precision.Because different maps are possible, the one that achieves optimal similarity is used.

The CONLL Metric and the CONLL Scorer
After a few years in which different proposals were often difficult to compare because various researchers favored different metrics, Denis & Balridge (2007) proposed simply using the average among the F values obtained using muc, b 3 , and ceaf.This was the score used in the conll shared tasks in 2011 and 2012 (Pradhan et al. 2012).Since then, the reference scorer that computes this metric and takes into account a few issues that emerged (Pradhan et al. 2014) has become the standard scorer for the field (https://github.com/conll/reference-coreference-scorers).

The Current Practice of Evaluation in Anaphoric Reference
Although the field has now developed a unified approach to evaluation, the current practice of taking the average of three metrics cannot be considered entirely satisfactory, which is why new metrics are still being introduced every few years (for review, see Luo & Pradhan 2016, Yu et al. 2022).This aspect of current practice could certainly benefit from a reanalysis of which, if any, among the current metrics best captures linguistic intuitions or at least is best suited for practical applications (Barbu & Mitkov 2001).
There is also a need to move beyond simple identity anaphora.Because the conll reference scorer only scores identity reference, extended scorers were developed for the crac 2018 shared task (Poesio et al. 2018) and the codi/crac 2021 shared task on anaphora in dialogue (Khosla et al. 2021).The scorer for the latter (Yu et al. 2022) is compatible with the old conll reference scorer for identity coreference but also scores the identification of singletons, nonreferring expressions, split-antecedent anaphora, bridging references, and discourse deixis and is the official scorer for the Universal Anaphora initiative (https://github.com/juntaoy/universal-anaphorascorer).However, the discussion on how to evaluate these other types of anaphoric reference has only begun.

LI09CH28_Poesio
ARjats.cls November 4, 2022 14:14 For an extensive discussion of the main evaluation metrics in coreference, and several examples explaining their working in more detail, readers are referred to Luo & Pradhan (2016); for a discussion of coreference shared tasks, readers may consult Recasens & Pradhan (2016).

Neural Networks for Coreference and the End2End Model
The paper by Wiseman et al. (2015) marked the start of the most recent shift in computational models of anaphora, from the statistical models discussed in Section 4.3 to models using neural networks to learn nonlinear functions of the input.From that point on, every improvement of the state of the art has been achieved by neural models (Lee et al. 2017(Lee et al. , 2018;;Joshi et al. 2019Joshi et al. , 2020;;Kantor & Globerson 2019;Yu et al. 2020c).
6.1.1.Embeddings.One important characteristic common to all neural models of coreference resolution is that they take as input word embeddings (Bengio et al. 2003).As discussed in the previous sections, for many years computational linguists tried to attain better lexical semantic representations by developing distributional semantics methods for learning them from corpora, but with disappointing results (Versley et al. 2016).One reason for the success of neural network models after 2010 was the emergence of a much more effective type of lexical representation, word embeddings-continuous representations learned in an unsupervised way by neural language models (Mikolov et al. 2013).
6.1.2.The End2End model.The paradigmatic neural architecture for anaphora resolutionthe deep learning equivalent of the Soon et al. (2001) model-is the End2End (E2E) model proposed by Lee et al. (2017).The E2E model is a mention-pair model, but it has three key characteristics that mark a radical departure from the statistical models discussed in Section 4.3.First, as the name suggests, mention detection and antecedent identification are carried out jointly.The advantages of carrying out these tasks jointly had already been demonstrated by Daume & Marcu (2005); Lee et al. and subsequent researchers such as Yu et al. (2020a) provided conclusive evidence that with OntoNotes-size data sets, carrying out these two tasks jointly is the optimal solution.Like all neural models for anaphora resolution, the E2E model takes as input a sequence of word embeddings x i instead of a simple bag of words.The model considers all possible spans of these words and computes a span representation-a candidate mention representation-for each.Pairs of these span representations form the mention pairs classified by the model.The second important characteristic of the model is the span representation itself, how it is computed, and the notion of "headedness" it uses.A neural network-a bidirectional lstm-is used to compute word representations x * s,i for each word x s,i in span s, and an attention layer is then used to assign a relative weight α s,i to each word, out of which a weighted representation xs is then computed for the whole span.The span representation g s is then specified as a quadruple g s = [x * s,START , x * s,END , xs , φ s ] consisting of the word representations for the first and last word in the span, the weighted representation, and a couple of other features.This means that the model learns, and in a task-specific way, (a) what is the best representation for the mention as a whole, and (b) a "soft" notion of head assigning a weight to each of the words in the NP, including modifiers, determiners, and so forth.This approach is believed to address many of the difficulties identified in earlier work (e.g., how to define heads in a general way) and is one of the key reasons for the success of the E2E model.

Learned features.
The third crucial feature of Lee et al.'s (2017) model is that it takes only word embeddings as features.This is another clear difference from the statistical models

From Static to Context-Sensitive Embeddings
Shortly after the E2E model was proposed, another major technical innovation in deep learning resulted in further and substantial improvements: the development of so-called context-sensitive embeddings like elmo (Peters et al. 2018) and bert (Devlin et al. 2019).These are embeddings that, unlike earlier pretrained embeddings such as Word2Vec (Mikolov et al. 2013), assign different interpretations to words depending on the context.For many years CL researchers had tried without success to demonstrate that wordsense disambiguation in context was important (Versley et al. 2016); elmo and bert provided conclusive evidence for this.Adding elmo to the E2E model immediately resulted in an improvement of more than six percentage points over the original version of the model, from 68.6 to 73.0 on conll score (Lee et al. 2018).The subsequent development of the bert model (Devlin et al. 2019) resulted in another three-percentage-point improvement ( Joshi et al. 2019, Kantor & Globerson 2019).More recently still, the SpanBert approach to training bert with the type of spans used in anaphora resulted in another three-percentage-point improvement on OntoNotes.That is, the performance on OntoNotes has improved by almost 20 percentage points in the space of 5 years.Crucially, a big part of this improvement is due to the fact that these models supply much of the knowledge required by an anaphoric interpreter.

The State of the Art for Identity Reference in News
The current state of the art for anaphora resolution in news articles from OntoNotes and from arrau, in which nonreferring expressions and singletons are also annotated, is summarized in Table 2.The table reports the results on OntoNotes of (the latest version of ) the two highestperforming statistical models in the conll 2012 shared task (Björkelund & Kuhn 2014, Fernandes et al. 2014) as well as the results of two statistical models that further pushed performance on that data set, followed by the best-known neural models prior to the E2E model, and then by the models using increasingly more sophisticated context-sensitive embeddings.We also provide the results on arrau of the only neural model (Yu et al. 2020c) that reported results on that data set.

BEYOND IDENTITY ANAPHORA IN NEWS
Virtually all the research discussed in Section 6 focuses on the resolution of identity reference in news text from the OntoNotes corpus.However, one of the most exciting developments of the last 5 years is that the field is moving beyond this narrow focus to cover anaphora resolution in other genres such as scientific documents or fiction, addressing the challenges raised by benchmark data sets such as the Winograd Schema Challenge, and looking at other types of anaphora, such as bridging reference and discourse deixis.
7.1.Other Genres: Scientific Documents, Dialogue 7.1.1.Biomedical texts.After news, the most researched genre in anaphora resolution is scientific articles-in particular, in the biomedical domain.Data sets such as genia (Yang et al. 2004) and craft (Cohen et al. 2017) have supported the development of several models for this genre and a number of shared tasks, the best known of which are those organized in connection with the bionlp workshops (Nguyen et al. 2011, Baumgartner et al. 2019).More recently, this genre has witnessed the deployment of systems using embeddings specially trained for scientific and biomedical texts (Zhang et al. 2019).

Dialogue and conversational agents.
Although some of the initial work on anaphora in CL was motivated by research on question-answering systems and task-oriented dialogue systems (e.g., Webber 1979), most research in the data-driven period has been focused on written text, for lack of suitable corpora.The few exceptions typically have involved the researchers creating the necessary data sets themselves (Poesio 1994, Byron 2002, Müller 2008).The one language for which substantial corpora of anaphora in dialogue exist is French: The ancor corpus (Muzerelle et al. 2014) has enabled the development of the end-to-end neural model for coreference interpretation such as that of Grobol (2020).It is hoped that the data sets created in the recent shared tasks on Anaphora Resolution in Dialogue (Khosla et al. 2021) will encourage more research in this genre.

The Winograd Schema Challenge
Computational work on the Winograd Schema Challenge can be categorized in a broadly similar way to research on anaphora resolution in news.The first computational models were statistical models such as that of Rahman & Ng (2012).More recently, however, most models for this task have been neural.The top performing among such systems use pretrained language models such 28.18Poesio et al.

LI09CH28_Poesio
ARjats.cls November 4, 2022 14:14 as bert (Devlin et al. 2019)-indeed, such models are often assessed using benchmarks including the Winograd Schema Challenge as a subtask, such as glue (Wang et al. 2019).An example of a system using bert is that of Kocijan et al. (2019).

Bridging Reference Resolution
Bridging reference resolution is a popular area of research because it involves modeling both inference and salience, two of the most studied preferences in anaphoric interpretation (Sidner 1979).In work following Sidner's, the emphasis shifted to how to acquire the required knowledge, whether from lexical resources (Vieira & Poesio 2000) or from corpora (Poesio et al. 2004a, Markert & Nissim 2005).Also, whereas early work on bridging resolution mostly focused on bridging reference via definite nominals (Sidner 1979, Vieira & Poesio 2000), later systems covered all types of bridging references (Poesio et al. 2004a, Hou et al. 2018, Roesiger et al. 2018, Yu & Poesio 2020).The work of Hou et al. (2018) represents the current state of the art on full bridging resolution, but it was evaluated only on isnotes.The first neural model for bridging reference resolution was proposed by Yu & Poesio (2020).2013) was the use of large amounts of synthetically created training data to alleviate data sparsity, which is the main problem with using machine learning to develop models for discourse deixis.This approach to data creation, which was further developed by Marasović et al. (2017), is receiving increasing attention for rare aspects of anaphora resolution.

Split-antecedent plurals.
Early research on split-antecedent anaphora (Eschenbach et al. 1989, Kamp & Reyle 1993) mostly focused on the constraints on the construction of complex entities from singular entities.More recent studies, such as that of Vala et al. (2016), focused on a subset of the problem.The first neural system for resolving split-antecedent anaphora that are expressed by both pronouns and other types of NPs was developed by Yu et al. (2020b), testing on arrau.Yu et al. (2021) proposed the first neural system resolving both single and split-antecedent anaphora and not requiring gold mention input.Cohen et al. (2017) provide a useful survey not only of existing data sets for coreference resolution in biomedical texts but also of proposed models for the genre.The literature on tackling the Winograd Schema Challenge is systematically surveyed by Kocijan et al. (2020).The recent article by Kobayashi & Ng (2020) reviews the literature on bridging, whereas the literature on discourse deixis is systematically covered by Kolhatkar et al. (2018).

SUMMARY POINTS
mention-pair model.Even if the muc corpora and other data sets that became available at the time were fairly small, they enabled the development of the first anaphora resolution models that employed machine learning methods(Aone & Bennett 1995, Vieira & Poesio 2000).
deixis.There has not been a lot of work on discourse deixis resolution.The first implemented anaphora resolution system resolving discourse deixis is Byron's (2002) phora rule-based algorithm for pronoun interpretation in dialogue.The first machine learning-based models for pronoun resolution covering both anaphora and discourse deixis were proposed byMüller (2008).Kolhatkar et al. (2013) concentrated on resolving definite nominals containing what they called shell nouns: nouns like issue, which have a preferential abstract interpretation.A key innovation fromKolhatkar et al. ( i mother trying to shield [him] i from [[his] i father's] j excesses."[[Your] i father] j doesn't mean it," she would console [him] i ."[He] j loves [you] i , [he] j 's a good man."And for years [he] i thought she was making excuses."But she wasn't.[He] j is a good man."Just a product of [his] j time. (Karttunen 1976)troduce new entities in a discourse or link to previously introduced entities; examples include the references to Maupin in example 1.The items annotated in anaphoric corpora tend to be a subset of referring NPs.But other types of NPs also exist.Quantificational NPs such as No one in No one would put the blame on him/herself (Partee 1972) do not refer to an individual or set of individuals but can still participate in anaphoric relations even though anaphoric reference to quantifiers has distinctive properties (Partee 1972) and is subject to semantic constraints(Karttunen 1976).Predicative NPs express properties of objects: For instance, in the clause He is a good man in example 1, the NP a good man does not introduce a new discourse entity or refer back to an existing discourse entity but instead expresses a property of Maupin's father.Finally, in languages like English, forms like it and there can also be used to express semantically vacuous expletives as well as pronouns, as in It is half 2.1.3.The semantic function of noun phrases.

Main anaphoric data sets in use today. Table 1 summarizes
(Markert et al. 2012)r which substantial data sets have become available include encyclopedic texts, particularly from Wikipedia (covered, e.g., in the Phrase Detectives corpus;Poesio et al. 2019), and fiction and literary texts (covered, e.g., in Phrase Detectives and litbank;Bamman et al. 2020).theanaphoricdatasetsmostwidelyusedtoday.Only corpora of at least 300,000 tokens are listed in Table1, with the exception of gum, isnotes(Markert et al. 2012), and litbank, which are widely used.For each data set, Table1lists the language; the genre(s); the size in tokens; whether multiple levels of annotation are included (Treebanking); which definition of coreference is used [muc(Chinchor & Nedoluzhko et al. 2021sive discussion of anaphoric data sets can be found in a book chapter byPoesio et al. (2016a), but it does not cover data sets released since 2015 (for those, seeNedoluzhko et al. 2021 Yu et al. (2020c)eep learning models, Lee , Yu et al. 2020cl is able to learn by itself almost all the linguistic generalizations required by anaphora resolution directly from the data, without the kind of feature engineering required even by statistical models.Another direction of research aimed at improving the E2E model focuses on using cluster ranking instead of mention ranking(Lee et al. 2018, Kantor & Globerson  2019, Yu et al. 2020c).For instance, the entity equalization model byKantor & Globerson (2019)resulted in significant improvements through their method for building cluster representations out of mention representations.The cluster ranking model byYu et al. (2020c)also achieved significant improvements and is notable as the only model discussed in this section to also carry out nonreferring expressions and singleton identification.