The Rational Speech Act Framework

The past decade has seen the rapid development of a new approach to prag-matics that attempts to integrate insights from formal and experimental semantics and pragmatics, psycholinguistics, and computational cognitive science in the study of meaning: probabilistic pragmatics. The most influential probabilistic approach to pragmatics is the Rational Speech Act (RSA) framework. In this review, I demonstrate the basic mechanics and commitments of RSA as well as some of its standard extensions, highlighting the key features that have led to its success in accounting for a wide variety of pragmatic phenomena. Fundamentally, it treats language as probabilistic, informativeness as gradient, alternatives as context-dependent, and subjective prior beliefs (world knowledge) as a crucial facet of interpretation. It also provides an integrated account of the link between production and interpretation. I highlight key challenges for RSA, which include scalability, the treatment of the boundedness of cognition, and the incremental and compositional nature of language.


INTRODUCTION
The past decade has seen the rapid development of a new approach to pragmatics that attempts to integrate insights from formal and experimental semantics and pragmatics, psycholinguistics, and computational cognitive science in the study of meaning: probabilistic pragmatics.The key advance in this area has been to provide a formal framework within which to implement hypotheses about how speakers (more generally, producers of language) contextually choose between utterance alternatives and how listeners (more generally, interpreters of language) contextually arrive at interpretations of observed utterances.Crucial features of the framework include the following: (a) It provides a formalization of long-recognized but elusive-to-formalize general principles of conversation (e.g., that speakers tend to be relevant, brief, and otherwise helpfully informative), which listeners in turn take into account in interpretation (Grice 1975, Sperber & Wilson 1986); (b) it treats language production and interpretation as fundamentally probabilistic processes that are subject to the same principles of boundedly rational information integration as processes in other domains of cognition, perception, and action (Tenenbaum et al. 2011); and (c) it provides a principled way for linguistic knowledge to interact with communicative pressures and subjective beliefs about the world,1 identified in psycholinguistics as an important factor in modulating both incremental language processing and global utterance interpretation (Chambers et al. 2004, Warren & McConnell 2007, Winograd 1972).Probabilistic pragmatics is thus engaged in bridging the language-as-product and language-as-action traditions (Clark 1992): It relies on our best guesses about the syntactic and semantic representations of words, phrases, and sentences (language-as-product) and provides a theory of how agents embedded in a social, communicative context with particular goals and background beliefs should make decisions about the use of those linguistic units (language-as-action). Methodologically, probabilistic pragmatics is strongly computational (models are implemented as computer programs) and data-driven (models are tested against and revised in light of empirical-e.g., experimental or corpus-data).
RSA is simultaneously a theory undergoing constant incremental revision and a tool for explicit formalization of competing hypotheses that can be tested against data.It is in this sense that RSA is a framework (or research program; see Lakatos 1970) for investigating pragmatics.In this article, I demonstrate the basic mechanics and commitments of RSA as well as some of its standard extensions, using the well-worn example of scalar implicature.2I end by highlighting current limitations and future directions.

THE CLASSIC VIEW OF MEANING
The classic view of semantics and pragmatics encompasses accounts that differ in the phenomena they seek to explain, but crucially share the feature that meaning is treated as categorical.For example, implicatures are either computed or not; presuppositions do or do not project; the interpretation of gradable adjectives, uncertainty expressions, and generics is taken to make use of categorical thresholds.Gradience is eschewed.Apparent gradience in interpretation is typically accommodated by postulating exceptions or by implicating processing constraints.
To illustrate, consider the example of scalar implicature.The classically assumed Gricean reasoning about a listener taking a sentence like example 1 to implicate the sentence in example 2 makes reference to a stronger alternative utterance (shown in example 3) that the speaker could have produced but chose not to. (1) Alex ate some of the cookies. (2) Alex ate some, but not all, of the cookies.( 3) Alex ate all of the cookies.
In particular, the reasoning is typically assumed to go as follows (Grice 1975): 1. Premise: The speaker uttered the weaker sentence with some instead of the stronger alternative with all, which would have been relevant and more informative.2. Premise: If the speaker knew that the stronger alternative was true, they would have uttered it.3. Premise: The speaker is well-informed with respect to the truth of the stronger alternative.4. Conclusion: Thus, the stronger sentence must be false.
The second premise is the result of the pressure to be as informative as possible.The third premise is called the Competence Assumption or Epistemic Step (for discussion, see Breheny et al. 2013, Horn 1972, Russell 2006, Sauerland 2004, van Rooij & Schulz 2004).
Under the classic view, an implicature is a categorical phenomenon: If the premises are true, it arises; if at least one is not, it does not.There is no space for an implicature to be more or less likely to arise or for the listener to be more or less certain about whether Alex ate all of the cookies after observing an utterance of example 1.Similarly, the truth of the premises is typically treated as a categorical matter: Either the stronger alternative is relevant or not, either it is more informative than the observed utterance or not, and either the speaker is well informed with respect to the stronger alternative or not.Moreover, the speaker is categorically expected to produce the stronger alternative if they know it to be true.And finally, informativeness is defined in terms of a categorical ordering relation between alternatives (Fox & Katzir 2011, Gotzner & Romoli 2022, Hirschberg 1985, Horn 1972): A stronger alternative like example 3 is categorically more informative than its weaker counterpart in example 1.There is no contextual modulation of the notion of informativeness, nor is world knowledge-that is, prior beliefs about likely meanings-assumed to play a role in the reasoning process.Each of these pieces is treated differently under the RSA perspective.To see how, let us work through the basic mechanics of RSA and apply it to the case of scalar implicature.

The Main Ingredients
Like other probabilistic pragmatics accounts, RSA treats language use as an instance of a signaling game (Lewis 1969).Speakers and listeners are modeled as reasoning about an explicitly defined set of utterances U and space of possible meanings M.3 RSA models contain a semantic foundation in the form of denotation functions (standardly taken to be mutually known between interlocutors) associated with the different utterance choices under consideration, [[•]]: U → M.These return the set of meanings literally compatible with an utterance.Based on this semantic foundation, recursive probabilistic production and interpretation rules are formulated.Recursion may begin with literal production or literal interpretation and may proceed to any depth (for overviews, see Franke & Jäger 2016, Goodman & Frank 2016).In the following, I assume basic familiarity with probability theory (for an introduction to probability theory in semantics and pragmatics, see Erk 2022).
In the basic RSA model, a literal listener forms the basis of the recursive reasoning process, capturing interpretation choices of a listener who interprets utterances according to their literal semantics.A pragmatic speaker reasons about this literal listener, choosing utterances that balance utterance informativeness and utterance cost.Intuitively, an utterance is informative if it increases the chance that the literal listener would correctly infer the intended meaning.Utterance cost is a more abstract notion and (depending on the phenomenon) may capture retrieval difficulty, complexity (phonetic, phonological, morphological, or syntactic), and/or other meaning-independent factors that make the utterance more costly to produce.Finally, a pragmatic listener is treated as reverse-engineering the speaker's most likely intended meaning based on both their prior subjective beliefs about likely intended meanings (independent of the observation of a particular utterance) and their expectations about the pragmatic speaker's likely production choices under different possible meanings they might want to communicate.

Formal Characterization and Application to Scalar Implicature
I begin by characterizing the assumptions of the basic scalar implicature game, before walking through the individual reasoning components that comprise the full RSA model.

The scalar implicature game.
To apply RSA to the cookie scenario in example 1, we must specify a space of meanings and utterances, as well as the literal semantics of utterances (see the sidebar titled Basic Scalar Implicature Game).Let us assume there are 4 cookies in context that Alex may have eaten, resulting in 5 different possible world states or meanings a speaker may want to communicate, corresponding to the case where Alex ate 0, 1, . . .,4 cookies. 4Let us further assume a minimal set of available utterances, corresponding to the classically assumed

BASIC SCALAR IMPLICATURE GAME
Meaning space: M = {m 0 , m 1 , m 2 , m 3 , m 4 } Utterance space: U = {u all , u some , u none } Semantics: = {m 0 } Prior beliefs: P(m 0 ) = P(m 1 ) = P(m 2 ) = P(m 3 ) = P(m 4 ) = 0.2 scalar alternatives in examples 1 and 3, and the alternative Alex ate none of the cookies. 5The literal semantics for each u in U is defined extensionally according to the standard truth-conditional semantics of the quantifiers all, some, and none.Finally, we must define prior beliefs about meanings.We will revisit this assumption below, but for now we may assume a uniform distribution over meanings, which means that a priori-before observing language-interlocutors expect Alex to be equally likely to eat 1, 2, etc., cookies (see Figure 1a).These probabilities capture the listener's subjective beliefs about the world before observing language, rather than objective probabilities about the world.

The literal listener.
The literal listener is characterized by an interpretation rule P L 0 : Here, δ mࢠ [[u]] is the delta function, which returns 1 if m is in the extension of u, and 0 otherwise.The literal listener thus returns the result of updating prior beliefs about likely meanings P(m) with [[u]].This provides a way of encoding the Gricean Quality maxim: Utterances that are literally false are simply not considered by the pragmatic speaker. 6,7,8 Applied to our example, the literal listener output is obtained by applying Equation 1 to each of the three utterance alternatives, resulting in the distributions shown in Figure 1b.The result is that u all and u none rule out many states of the world, while u some does not.

The pragmatic speaker.
Reasoning about the literal listener informs the pragmatic speaker rule P S 1 : 2.
This rule defines the speaker's production choice u as softmax optimizing u's utility for communicating m, U(u, m).9An utterance's utility U(u; m) is defined as a trade-off between the utterance's informativeness as characterized by P L 0 (m|u)-how likely it is that a literal listener will correctly infer m from u's literal semantics alone-and its cost: The informativeness term captures the spirit of the Gricean Quantity maxims (and even Relation; more on this below): The greater the literal listener's belief in the speaker's intended m is after observing u, the more informative u is, and hence the higher its utility. 10The cost term captures the spirit of part of the Gricean Manner maxim: the cheaper (e.g., shorter) the utterance, the better.Thus, given two equally informative utterances, the pragmatic speaker prefers the less costly one.
Similarly, given two equally costly utterances, the pragmatic speaker prefers the more informative one.
Equation 2 also includes a utility-scaling parameter α, which governs the extent to which the speaker is a utility-maximizing agent.As α goes to infinity, the agent ceases to choose utterances probabilistically and instead deterministically chooses the utility-maximizing utterance.If α is 0, the speaker randomly produces utterances.If α is 1, the speaker produces utterances proportional to their utility (Luce 1959).
To compute pragmatic speaker probabilities in our cookie scenario, we must set a value for α and define the cost of utterances.For simplicity, and to demonstrate the purely informativenessdriven speaker under the probability-matching regime, we begin by setting cost to 0 and α to 1. Pragmatic speaker probability of using u some or u all to refer to m 4 under varying α, derived from the literal listener in Figure 1b.
The resulting production probabilities are shown in Figure 1c.The pragmatic speaker who wants to communicate meaning m 0 or m 1 -m 3 has no choice but to produce u none and u some , respectively.The interesting situation is the one in which the speaker intends to communicate m 4 .Here, there are two utterances that have a nonzero probability of allowing the literal listener to correctly infer the intended meaning: u some and u all .However, u all is four times as likely (probability 1) to lead to the correct inference as u some (probability 0.25).The pragmatic speaker with α = 1 is hence four times as likely to produce u all (with a probability of 0.8) as u some (with a probability of 0.2).The effect of increasing α is for probabilities to become more extreme (i.e., for the speaker to become more likely to produce u all ; see Figure 2).

The pragmatic listener.
Finally, the pragmatic listener is characterized by interpretation rule P L 1 : 11 That is, the pragmatic listener uses Bayes' rule 12 to infer the most likely meaning m by drawing on their generative model of the speaker and their prior beliefs about m. 13  We are now in a position to assess whether and how the classically observed scalar implicature emerges in our cookie scenario: Since both u none and u all are literally compatible with only a single meaning, observing these does not result in a change compared with the literal listener.However, observing u some does.The result of pragmatic reasoning about alternatives within RSA-that is, of computing the interpretation probabilities for the pragmatic listener who observes u some -is shown in Figure 1d.Informally, the contribution of pragmatics is captured in the difference in beliefs between the literal and the pragmatic listener.In this case, compared to the literal listener for whom each m i with i > 0 has an equal probability of 0.25, the pragmatic listener's posterior 11 The full equation is The formula is often written with proportionality and the denominator dropped, since in the comparison of two meanings m 1 and m 2 , the observed utterance u and hence P(u) remain constant in the comparison. 12For an introduction to Bayes' rule in semantics and pragmatics, see Erk (2022). 13I use the term generative model here in its cognitive science guise: as denoting a causal model of observed statistical data (Clark 2013).In this case, the observed statistical data are probabilistic production choices.

RSA VERSUS THE CLASSIC VIEW
RSA retains some features of the classic view, including reasoning about alternatives.The outcome-for instance, for scalar implicature, that the listener is, ceteris paribus, more likely than not to believe the negation of the stronger alternative-is also qualitatively similar.The main differences include that in RSA, interpretation is treated as probabilistic rather than categorical, listeners' production expectations are probabilistic rather than deterministic, the notion of informativeness is gradient rather than categorical, alternatives may be dynamic and context-dependent rather than static and lexicalized, and prior beliefs about likely meanings play an important and systematic role in interpretation.Table 1 provides an overview of these core differences between accounts.

Language as Fundamentally Probabilistic
One of the crucial ingredients explaining the success that RSA has had in capturing such a wide range of phenomena is its basic assumption that language production and interpretation are inherently probabilistic interlinked processes that give rise to variable choices.This is an important advance over standard semantic and pragmatic theories, which cannot account for the widespread interpretational variability observed across experimental tasks and contexts and do not spell out the relation between production and interpretation.
In the case of scalar implicature, the difference between the classic and RSA views is as follows: Rather than inferring that the speaker categorically intended to convey the negation of the stronger alternative (i.e., assigning probability 0 to m 4 ; classic view), the pragmatic listener is characterized by a posterior probability distribution over meanings, which captures the listener's uncertainty about the speaker's intended meaning.
One of RSA's great strengths is not only that it predicts empirically observed variability in production and interpretation but that, by virtue of its computational and data-oriented nature, specifying explicit linking hypotheses from model quantities (e.g., pragmatic speaker production probabilities, pragmatic listener interpretation probabilities) to empirical measures is part and parcel of the modeling enterprise.For instance, Waldon & Degen (2020) link truth value judgments to RSA speaker probabilities: the greater the RSA speaker probability of producing an utterance observed in the experiment, the more likely participants should be to respond "true" (in a twoalternative forced-choice task).They also provide a generalization of this linking function to more than two alternatives.In contrast, Potts et al. (2016) link truth value judgments to the pragmatic listener.How to best model truth value judgments (and indeed, whether to think of them underlyingly as a measure of interpretation or production) is an open question (see also discussion in Jasbi et al. 2019).This is true for most commonly used measures in experimental semantics and pragmatics.While there are currently few established linking conventions, an important advance supported by RSA is that linking functions have become an active area of research in pragmatics (for discussion of linking functions in pragmatics, see Chemla & Singh 2014, Scontras & Pearl 2021, Waldon & Degen 2020).There is still much work to be done in validating linking functions.This is not an RSA-internal problem but rather a problem for any subfield of linguistics that aims to test theories experimentally.In the absence of clear and reasonable linking functions, the danger of uninterpretable data is high.

Informativeness: Gradient, Not Categorical
The notion of informativeness that the classic view departs from was originally an entailmentbased one (Horn 1972, Matsumoto 1995): One alternative is more informative than the other if it asymmetrically entails it.Thus, all, some is classically taken to be a lexicalized scale because declarative sentences with unembedded some are entailed by those same sentences if some is replaced by all.However, it was soon recognized that what is necessary for scalar implicatures to be derived is not an entailment relation between scalemates specifically, but any ordering relation that identifies a stronger and a weaker scalemate, including (among many others) ranking of entities, states, and attributes; whole/part relationships; and type/subtype, instance-of, and generalization/specialization relations (Carston 1998, Hirschberg 1985).
RSA uses a related but more general-and, importantly, gradient-notion of informativeness: the amount of information that a literal listener would not yet know about whether m is true after hearing it described by utterance u.Put differently, it is the post-utterance surprisal of m (Cover 1999) that the speaker aims to minimize (Goodman & Stuhlmüller 2013).To see that this notion captures the entailment-based asymmetry for all, some , we can compute the surprisal of m 4 given Annu.Rev. Linguist.2023.9:519-540.Downloaded from www.annualreviews.orgAccess provided by University of California -Irvine on 04/06/23.See copyright for approved use.
Thus, the surprisal of m 4 is lower after observing u all rather than u some .This qualitative relation holds for any alternatives that form strength-ordered scales under the classic view.However, there are two ways that this notion of informativeness goes beyond the classic view: First, alternatives need not function on lexicalized scales to stand in an informativeness relationship to one another.This opens the door for modeling alternatives-based reasoning across phenomena in a unified way without requiring special definitions for scalar implicatures in particular.This is a notable strength of the framework.Second, this notion of informativeness is gradient and depends on the size of the contextual space of meanings.To see this, consider two alternative cookie scenarios: one in which there are only 3 cookies, and one in which there are 10.Given the same assumption of a uniform prior over meanings, the literal listener probabilities of inferring m 3 and m 10 are 0.33 and 0.1, respectively.Thus, while the surprisal of the maximal state upon observing u all remains the same, observing u some with 3 cookies yields lower surprisal (1.1) than with 10 cookies (2.3).Thus, u all is much more informative than u some with 10 cookies, but less informative with 3 cookies, making it more preferred in the former and less preferred in the latter case.This has downstream effects on the pragmatic listener: The posterior belief in the maximal state after observing u some should be lower (i.e., the implicature "stronger") when the state space is larger.Under the categorical notion of informativeness assumed by the classic view, no such effect of contextual meaning alternatives is predicted.To our knowledge, this is an empirical prediction that has yet to be tested in detail.

Alternatives: Contextually Specified, Not (Necessarily) Lexicalized
RSA's treatment of utterance alternatives is very flexible in that it stipulates no inherent restrictions on alternatives for a given phenomenon.It simply is not a theory of alternatives.This is a potential advantage of RSA because it allows the researcher the freedom to plug in domain-or phenomenon-specific sets of alternatives.It can even be used as a hypothesis-testing tool to ask which set of alternatives is most justified by speaker/listener choices in targeted experimental tasks (Franke 2014, Peloquin & Frank 2016).
To see the effect of varying alternatives, we may turn back to our cookie example.Thus far, we have specified the classically assumed alternatives to model scalar implicature, u some and u all . 15hat is, we have implemented a particular hypothesis about which alternatives are relevant, which is consistent with both the classic view of lexicalized scales (Horn 1972) and with structural theories of alternatives assumed by grammatical approaches to scalar implicature (Katzir 2007, Fox & Katzir 2011). 16owever, recent research has shown experimentally that inclusion of additional utterance alternatives can modulate the computation of scalar implicatures (Degen & Tanenhaus 2016, Sun & Breheny 2020).For instance, naturalness ratings for some applied to small set sizes decrease if number terms like two and three are contextually available (Degen & Tanenhaus 2015).Does RSA predict this result?To answer this question, we must first decide how to map naturalness ratings (a behavioral measure) onto a model quantity.That is, we must specify a linking hypothesis.Informally, we assume that naturalness ratings are a measure of the speaker's production probability.The greater the probability, the higher the naturalness rating, and vice versa.Thus, we can ask, does including number terms lead to a decrease in the RSA speaker production probability of u some ?In the basic scalar implicature example, for the intermediate states m 1 -m 4 , the only utterance available to the speaker is u some .That is, the probability of u some for communicating any of these states is 1.It is now easy to see that introducing number alternatives reduces this probability: Assuming an exact semantics for numbers (though that is not necessary), the number term is a more informative alternative to u some and is hence preferred by the speaker.
Indeed, recent work has shown that despite number terms and weaker quantifiers like few not classically being taken to constitute alternatives to some, including them in the set of utterance alternatives is justified by experimental data (Franke 2014, Peloquin & Frank 2016), suggesting a derigidification of theories of alternatives.Bergen et al. (2016) discuss further arguments against too many restrictions on sets of alternatives, in favor of leaving the selection of the relevant ones up to aspects of context.
One might worry that this flexibility in alternative sets invites problems like the symmetry problem (Gotzner & Romoli 2022): If no restrictions are placed on alternatives, the stronger alternative u some_but_not_all might be included in the alternatives.Doing so under the classic view (and under grammatical accounts of scalar implicatures) leads to a contradiction because observing u some leads to the negation of both stronger alternatives.Under RSA, this problem is circumvented because (a) reasoning is probabilistic and (b) the increased complexity of the added alternative can be penalized by a cost term.This cost in turn leads the listener to reason that a speaker intending to communicate one of m1-m 3 would have been more likely to use the less costly u some than the more costly u some_but_not_all .The effect is that both of these alternatives can coexist as ways of signaling the same meanings (for more discussion, see Bergen et al. 2016).
In sum, the flexibility of RSA's treatment of alternatives and the ease with which RSA quantities can be linked to experimental data are useful properties.They allow the framework to be put to great theoretical use when employed as a hypothesis-testing tool for theories of alternatives.

World Knowledge: A Crucial Facet of Interpretation
Finally, a critical gain over the classic view is that RSA includes a systematic way to integrate listeners' prior beliefs about likely meanings into utterance interpretation.Psycholinguistic work has documented several ways in which such probabilistic beliefs about the world, often termed world knowledge, affect language processing and use (e.g., Hagoort et al. 2004, Hald et al. 2007, Warren & McConnell 2007, Westerbeek et al. 2015, Winograd 1972).In contrast, formal linguistic research on meaning in the tradition of Montague (1970)-which is devoted to specifying how meanings of expressions are computed from the meanings of the parts of the expressions, the way the parts are combined, and the contexts in which the expressions are used-has often sidelined world knowledge as nonlinguistic, encyclopedic knowledge that must enter into the meaning computation but whose effect has eluded systematic investigation and formalization (for relevant discussion see, e.g., Beaver 2001, Dowty 1986, Hobbs 2019, Peeters 2000). 17o see how prior beliefs are integrated in RSA, we may return to our cookie example.Thus far, we have assumed a uniform prior over meanings, corresponding to the explicit assumption that the pragmatic listener a priori believes each meaning to be equally likely.Making this assumption is equivalent to assuming that only the speaker likelihood function affects pragmatic interpretation.In practice, listeners often come to communicative settings with skewed beliefs.For instance, we may know that Alex, if confronted with a plate of cookies, is likely to devour them all.We can represent this with the prior over meanings depicted in Figure 1e: We assume that with some small constant probability, it is possible that Alex will stop before all the cookies are gone or shun the plate altogether, but our overwhelming belief is that Alex will eat all the cookies.In this case, observing u some still leads to a decrease in the posterior probability of m 4 compared with this prior, but the most likely outcome is still that Alex ate all of the cookies (see Figure 1f-h for visualizations of model quantities).
This way of integrating priors is not unique to RSA; Bayes' rule simply captures the mathematically optimal way for a rational agent to weight their prior beliefs against new evidence.It is a largely open empirical question to what extent prior beliefs affect pragmatic interpretation in the expected ways.Preliminary evidence suggests that these effects are weaker than expected for scalar inferences (Degen et al. 2015) but play a decisive explanatory role in the interpretation of generics (Tessler & Goodman 2019) and in the interpretation of referring expressions in simple reference games (Qing & Franke 2015, Sikos et al. 2021).

STANDARD RSA EXTENSIONS FOR MODELING CONTEXT
The basic model introduced up to this section already captures that language production and interpretation are probabilistic, uncertain processes.It also allows for implementing different types of context dependence, including the assumed meaning space, set of alternatives, and prior beliefs.However, it makes various idealizing assumptions regarding aspects of the communicative setting that interlocutors are treated as not having uncertainty about.In the following subsections, I introduce additional pieces of probabilistic machinery that are standardly deployed in RSA models to allow for more sophisticated integration of contextual information.I describe several contextual factors that have been shown to affect the interpretation of sentences featuring some.I also discuss how the machinery has been put to use in accounting for other phenomena.

Conditioning on Additional Variables: Question-Under-Discussion Effects
Utterance interpretation is widely assumed to be sensitive to the contextually salient Question Under Discussion [QUD; Roberts 2012Roberts (1998))].Scalar implicatures are no exception (Cummins & Rohde 2015, Degen & Goodman 2014, Ronai & Xiang 2021, Zondervan 2010).In particular, (implicit or explicit) QUDs that make the stronger alternative u all relevant are more likely to elicit scalar implicatures than QUDs that do not. 18In our cookie example, observing u some under the QUD q all (Did Alex eat all of the cookies?) is more likely to elicit a stronger scalar implicature (lead to lower belief in m 4 ) than the QUD q any (Did Alex eat any of the cookies?) (for similar examples, see Kursat & Degen 2020).Intuitively, this is because under q all , the stronger alternative is highly relevant: Knowing whether it is true or false exhaustively answers the question.In contrast, under q any , the stronger alternative answers the question if it is true, but not if it is false, making it somewhat less relevant.
Within RSA, QUD effects can be captured by conditioning the pragmatic listener on an additional variable q ࢠ Q, with Q = {q all , q any } that encodes the QUD: P L 1 (m|u, q) ∝ P S 1 (u|m, q) • P(m) • P(q).7.
In particular, q introduces varying projection functions that create equivalence classes in the meaning space: .
Each corresponding function q projects the literal listener's inferred meaning into one of the two cells of the partition induced by the QUD: This means that the speaker's communicative goal is to be informative only with respect to this partition: P S 1 (u|m, q) ∝ exp(α ln P L 0 (q(m)|u)).11.
The resulting pragmatic listener beliefs in m 4 after observing u some are shown in Figure 3.The pragmatic listener for q all is identical to the basic pragmatic listener.This is because the basic pragmatic listener, by virtue of the meaning space distinguishing between all possible numbers of cookies eaten, already implicitly assumes an even more fine-grained QUD than q all .In contrast, the posterior probability of m 4 increases under q any .This is because at the literal listener level, the QUD is fully (and positively) resolved regardless of whether u some or u all is observed.Thus, u some and u all are equally good alternatives for the speaker to resolve the QUD.The consequence is an upweighting of the speaker probability of producing u some if the true state of the world is m 4 .As a result, the pragmatic listener believes it is somewhat more likely that m 4 was the intended meaning upon observing u some .

Probability Meaning
Pragmatic listeners who observe u some in the simple scalar implicature game, derived from the prior in Figure 1a, and pragmatic speakers that vary in the addressed Question Under Discussion.
Note that the notion of relevance assumed here is simply the informativeness notion introduced above, but relativized to a coarser-grained representation of the meaning space.Thus, an alternative's relevance increases with the amount of information it provides about the cell of the partition induced by the QUD.This gradient notion of relevance captures the same qualitative effects as categorical notions of relevance, which take a proposition to be relevant to a QUD if and only if that proposition provides a full or partial answer to the QUD [e.g., Roberts 2012(1998), van Kuppevelt 1996, van Rooij & Schulz 2004] and adds nuance.
Conditioning on additional variables is a standard piece of probabilistic machinery that can be used to model effects of additional aspects of discourse on interlocutor choices.

Joint Inference: Reasoning About Common Ground
Joint inference refers to simultaneous reasoning about the value of multiple variables.It is the piece of machinery to employ when the listener's goal is to infer more than the speaker's intended meaning (e.g., additional aspects of the linguistic or discourse context that they have uncertainty about).For instance, in Section 5.1 we assumed a particular QUD as given.But frequently, listeners have uncertainty about the QUD.Joint inference has been successfully employed for reasoning jointly about the QUD and the speaker's intended meaning in the domains of hyperbole (Kao et al. 2014b), indirect speech (Yoon et al. 2020), and quantifier scope disambiguation in universally quantified sentences with negation (Scontras & Pearl 2021).Other variables that the joint inference approach has been used to model reasoning about include the degree threshold in the interpretation of gradable adjectives (Lassiter & Goodman 2013), the collective or distributive sense of a plural predication (Scontras & Goodman 2017), and a higher-order index of an interlocutor's social identity (Cohn-Gordon & Qing 2019).
To illustrate how joint inference works for our scalar implicature cookie scenario, we may return to the issue of prior beliefs.In the basic model, prior beliefs are taken to be shared by speaker and listener; that is, they are taken to be on common ground (Clark & Marshall 1981, Stalnaker 1978).However, common ground is not fixed; it is updated moment by moment during communication (Clark & Brennan 1991).One way in which common ground is updated is by adding new speaker commitments as the discourse unfolds.Another is by means of accommodation (Beaver 2001, Karttunen 1974, Lewis 1979): If the interpretation of an utterance requires that a certain presupposition not already contained in common ground be true, listeners are expected to add that presupposition to common ground.We can think of prior beliefs in the same way: Certain prior beliefs (e.g., that marbles, when thrown into a pool, are almost certainly going to sink) can be assumed to be shared between interlocutors.However, these beliefs may have to be updated if there is a good enough reason.The wonky worlds model of Degen et al. (2015) is precisely such a model of prior belief update.It captures that the documented effect of prior beliefs on the pragmatic listener is weaker than expected under the basic RSA model, by explicitly modeling uncertainty about the prior beliefs with respect to which the pragmatic listener computation occurs: 12.
The variable w captures whether the world is normal or wonky.If the world is normal, the usual prior over meanings is assumed [which Degen et al. (2015) elicited empirically for a large number of items].If the world is wonky, a backoff prior is assumed [which Degen et al. (2015) assumed to be uniform, capturing that all regular beliefs about the relevant part of the world are suspended]:19 13.
The value of w inferred by the pragmatic listener depends on P S 1 : The more unexpected an observed utterance is under the usual prior (because it is unlikely to be uttered under any plausible meaning m), the more likely the pragmatic listener is to infer that the world is wonky and, consequently, to use the backoff prior.Applied to our cookie example: If it is commonly known that Alex, when confronted with a plate of cookies, is likely to eat them all, the prior in Figure 1e is taken to be P usual (m), with the resulting prediction that the posterior probability of m 4 should be high.However, the overall probability of observing u some in the first place is lower under these skewed prior beliefs than under the uniform prior shown in Figure 1a. 20The pragmatic listener therefore infers that the backoff prior is somewhat more likely than it was a priori (perhaps Alex is on a diet, got distracted, or does not like the cookies).
To know exactly what the listener's posterior beliefs about the meaning m are, we can marginalize (take the weighted sum) over wonky and nonwonky worlds: 14.
A way to think of the pragmatic listener's posterior meaning beliefs is as a mixture of the standard pragmatic listener computation under either prior, weighted by how likely it is a priori that we are in a wonky world.Similarly, if we are interested in the pragmatic listener's posterior beliefs about how likely it is that the world is wonky, we can marginalize over meanings instead of over wonkiness.Degen et al. (2015) found that both marginal meaning and wonkiness values tracked listeners' experimentally elicited beliefs.
In sum, joint inference is a powerful piece of standard machinery that is useful in modeling reasoning about aspects of a discourse context that interlocutors might have uncertainty about.

Joint Inference: Lexical Uncertainty
A special class of joint inference RSA models, worth highlighting due to their explanatory power for a wide variety of phenomena, is lexical uncertainty models (Bergen et al. 2016).These extend the basic RSA model by including pragmatic listener reasoning about the lexical meaning that the speaker assigns to expressions.At the base of the recursion, the literal listener performs the computation of literal meaning under the assumption of different possible lexicons L: 15.
The pragmatic listener does not know which lexical entry the speaker is assuming, and reasons about it as follows: Given that the speaker said u and that u can have different lexical entries, which pair of meaning m and lexicon L is most likely to have caused the speaker to produce u?This kind of reasoning is particularly useful for modeling lexical learning phenomena, including in-the-moment adaptation to speaker-specific language use (Schuster & Degen 2020), convention formation (Hawkins et al. 2022), and word acquisition (Frank & Goodman 2014).Other areas of application include specificity implicatures, M-implicatures, free-choice inferences, and scalar diversity (Bergen et al. 2012, Champollion et al. 2019, Sun et al. 2018).A recently emerging point of contact between the RSA literature and grammatical approaches to scalar implicatures is the question of how best to account for embedded implicatures (Potts et al. 2016).Franke & Bergen (2020) have shown that it is in principle possible to synthesize RSA and grammatical approaches to implicature by treating RSA as an account of pragmatic reasoning operating on possible readings generated by local (subsentential) exhaustification proposed by grammatical approaches.This synthetic approach had more explanatory power on a data set of embedded implicature judgments than either of the approaches did in isolation.

CHALLENGES AND FUTURE DIRECTIONS
As is the case with any formally explicit modeling framework, RSA models include many simplifying assumptions that present a challenge to the overall framework's generalizability as well as interesting avenues for future research.In this section, I focus on three key issues.

Scalability
A common critique of RSA is that its scope is so limited as to not be useful for understanding real-world language use: It is typically applied in small toy meaning domains with very limited sets of utterance alternatives, requires explicitly specifying the literal semantics of all utterance alternatives, and does not extend beyond single-shot utterance production/interpretation. Thus, while RSA does very well in predicting production and interpretation choices for phenomena that can be broken down to such a manageable size, it is far from being able to predict naturally occurring language use.These challenges have begun to be addressed in multiple ways.First, fast-moving developments in natural language processing have allowed large pretrained neural language models to be combined with RSA reasoning, leading to successful modeling of production and interpretation of referring expressions (Cohn-Gordon et al. 2018, Monroe et al. 2017), image descriptions (Shen et al. 2019), and spatial instructions (Fried et al. 2018a,b).Second, attempts to go beyond single-shot utterances have yielded extensions of RSA to model question-answer pairs (Hawkins et al. 2015), adaptive belief update in response to repeated exposure (Schuster & Degen 2020), and convention formation (Hawkins et al. 2020(Hawkins et al. , 2022)).This is an exciting area of ongoing research.

Bounded Cognition
RSA provides computational-level explanations (Marr 1982) of pragmatic language use in the tradition of rational analysis (Anderson 1991).It makes very few assumptions about agents' resource limitations and instead seeks to characterize the optimal performance that could be achieved on a particular problem.This approach raises two challenges: First, Bayesian inference, the core RSA computation, is intractable for large-scale problems (Kwisthout et al. 2011).Specifying an account of how the mind rapidly and effectively approximates intractable calculations is an important area of ongoing research (Vul et al. 2014, White et al. 2020).Second, language users are cognitively bounded agents with limited memory, attention, and cognitive control.While α and the assumed depth of reasoning can both capture deviation from utility maximization, integrating insights from psycholinguistics that highlight the effects of resource limitations on language production and comprehension is an important avenue of future research (Ferreira & Patson 2007, Goldberg & Ferreira 2022).

Incrementality and Compositionality
Most RSA models make choices based on the utility of full utterances.However, a key insight from psycholinguistics is that language production and comprehension are both highly incremental processes: Speakers do not wait until they have planned an entire sentence before beginning to speak (Brown-Schmidt & Konopka 2015), and listeners do not wait until they have observed an entire sentence before beginning to interpret the unfolding signal and predicting upcoming material (Tanenhaus et al. 1995).Recent work has only just scratched the surface of incrementalizing RSA for explaining production (Cohn-Gordon et al. 2019, Waldon & Degen 2021) and comprehension (Augurzky et al. 2019, Kreiss & Degen 2020) phenomena.
Relatedly, a key insight from formal semantics is that language is compositional: The meaning of an expression is related to its structure and the meanings of its parts (Montague 1970).RSA is entirely compatible with a fully compositional system of meaning computation (for extensive discussion of RSA enhanced with stochastic λ-calculus, see Goodman & Lassiter 2015).In practice, most RSA models do not consider compositional structure (but c.f. Lassiter & Goodman 2013, Scontras & Pearl 2021, and Tessler & Goodman 2019 for interesting advances in modeling the downstream effects of compositionality on adjectival vagueness, generics, and quantifier scope disambiguation).

CONCLUSION
Probabilistic pragmatics has led to rapid advances in our understanding of how language is produced and interpreted, by integrating insights from cognitive science, psycholinguistics, and traditional formal semantics and pragmatics.RSA is such an account of pragmatic language production and interpretation that allows for implementing alternative hypotheses about the relative contributions of linguistic knowledge, communicative pressures, context, and prior beliefs about the world.It crucially provides an integrated account of production and interpretation and formalizes the link between the two in a systematic way.Progress has been swift and satisfying, but there is no shortage of challenges, outlined in this article, to address.

DISCLOSURE STATEMENT
The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.
Figure 1 Model outputs for simple scalar implicature game.(a) Uniform prior.(b) Literal listener with uniform prior.(c) Pragmatic speaker derived from panel b (α = 1).(d) Pragmatic listener derived from panels a and c.(e) Skewed prior.( f ) Literal listener with skewed prior.(g) Pragmatic speaker derived from panel e (α = 1).(h) Pragmatic listener derived from panels e and g. Figure2

Table 1 Features of the classic and RSA accounts of meaning
reflect a relatively stronger belief in m 1 , m 2 , and m 3 and a relatively weaker belief in m 4 .This relatively weaker belief in m 4 is what comes closest to the notion of a scalar implicature within RSA. beliefs