Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Amortizing Pragmatic Program Synthesis with Rankings

Yewen Pu    Saujas Vaduguru    Priyan Vaithilingam    Elena Glassman    Daniel Fried
Abstract

The usage of Rational Speech Acts (RSA) framework has been successful in building pragmatic program synthesizers that return programs which, in addition to being logically consistent with user-generated examples, account for the fact that a user chooses their examples informatively. We present a general method of amortizing the slow, exact RSA synthesizer. Our method first query the exact RSA synthesizer to compile a communication dataset. The dataset contains a number of example-dependent rankings of subsets of programs. It then distills a single global ranking of all programs as an approximation to every ranking in the dataset. This global ranking is then used at inference time to rank multiple logically consistent candidate programs generated from a fast, non-pragmatic synthesizer. Experiments on two program synthesis domains using our ranking method resulted in orders of magnitudes of speed ups compared to the exact RSA synthesizer, while being more accurate than a non-pragmatic synthesizer when communicating with humans. Finally, we prove that in the special case of synthesis from a single example, this approximation is exact.

Machine Learning, ICML

1 Introduction

For intelligent systems to be accessible to end users, it is important that they can infer the user’s intent under ambiguity. Imagine a person asking an AI assistant to generate a regular expression that matches the string 123-7890. It would be unhelpful if the AI assistant simply returned the regular expression ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT – the expression that matches all strings – although it is technically correct. The rational speech acts model (RSA) of pragmatics (Frank & Goodman, 2012) gives an algorithm for resolving ambiguities by modeling the user as a speaker that chooses informative examples for the system, via recursive Bayesian reasoning. Given several competing responses, for instance regex1subscriptregex1\texttt{regex}_{1}regex start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = \d{3}-\d{4} and regex2subscriptregex2\texttt{regex}_{2}regex start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, RSA would reason that it is more likely that an informative user would use the example 123-7890 to describe regex1subscriptregex1\texttt{regex}_{1}regex start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over regex2subscriptregex2\texttt{regex}_{2}regex start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, allowing it to prefer the intended regex. Recent works (Pu et al., 2020; Vaithilingam et al., 2023) have leveraged the RSA algorithm to build pragmatic program synthesizers – interactive systems that take in user given examples (e.g. strings) and return programs (e.g. regexes) that are both logically consistent and take into account the informativity of the chosen examples. Their algorithm, which we refer to as RSA, is applicable to any program synthesis domain where programs can be efficiently enumerated (Feser et al., 2015; Solar-Lezama, 2008; Gulwani, 2011), and produces a pragmatic synthesizer which interacts well with humans, while requiring no labeled human data.

Refer to caption
Figure 1: (left) Directly using the exact RSA algorithm in a pragmatic synthesizer L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is slow. (right) Our approach uses RSA to generate a simulated communication dataset between the informative speaker S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the pragmatic synthesizer L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and stores the responses of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as example-dependent rankings of subsets of programs. We then distill the dataset into a single example-agnostic global ranking of all programs σ[w]𝜎delimited-[]𝑤\sigma[w]italic_σ [ italic_w ]. This global ranking is then used to build a fast pragmatic synthesizer Lσsubscript𝐿𝜎L_{\sigma}italic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, by using the examples only to filter out consistent programs, then using the global ranking to sort them. This amortized synthesizer performs similar selections of programs as an exact RSA synthesizer, while being orders of magnetudes faster.

The RSA algorithm marginalizes across all possible examples (e.g. all strings) and programs (e.g. all regexes) multiple times. This makes it difficult to scale RSA to large domains, where users expect the system to complete its inference in real-time. Prior works in scaling up RSA computation (Monroe et al., 2017; Andreas & Klein, 2016) have largely focused on sampling and re-ranking, curbing RSA’s computation to a small subset of programs and examples. In this work, we show a simple yet effective way of amortizing RSA via a single global ranking of all programs. Rather than using RSA directly at inference time, our method uses it to generate training data in the form example-dependent rankings of subsets of programs. We then distill a global ranking from the training data, amortizing the computation of RSA (Figure 1). At inference time, a fast, non-pragmatic synthesizer is used to propose multiple logically consistent programs, and the global ranking is used to quickly rank them,111In our example, the regex ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT would be ranked lower than other consistent programs. resulting in a pragmatic yet efficient synthesizer.

This work makes the following contributions. (1) We describe a general method of amortizing the RSA algorithm (considered in Cohn-Gordon et al. (2018b); Pu et al. (2020); Vaithilingam et al. (2023)) applicable to any pragmatic program synthesis domains. (2) Using global ranking, we scale the model proposed by Vaithilingam et al. (2023) to a larger domain while still allowing for real-time interaction. We conduct a small user study validating that end-users are more accurate communicating with a ranking based program synthesizer compared to a non-pragmatic one (+27%, +41% relative). (3) We conduct simulated user studies by replaying the human interactive synthesis data collected from Pu et al. (2020) and Vaithilingam et al. (2023). We confirm that our ranking-based synthesizer retains the communicative accuracy of RSA (55%, 92% respectively), while running orders of magnitudes(over 100 times) faster. (4) We prove that in the special case of synthesis from just a single example, RSA_single, a setting studied in the original RSA literature (Goodman & Frank, 2016; Vogel et al., 2013; Monroe & Potts, 2015; Smith et al., 2013), the approximation using a global ranking is exact.

2 Background on Pragmatic Synthesis

In this section, we provide background on a reference game framework of program synthesis, which affords building a pragmatic synthesizer that can infer a user’s intended program from few examples (Pu et al., 2020). We illustrate this framework using a toy example from a small version of the regular expression domain of this work.

Refer to caption
Figure 2: A boolean lexicon for a small reference game of regular expressions. The rows are the utterances (strings) and the columns are hypotheses (regexes), and each entry denotes if a string is consistent with a regex. The L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT matrices show conditional probabilities that would be inferred by a synthesizer performing literal and pragmatic inference respectively.

2.1 Synthesis as a Reference Game

Consider the problem where a user gives example strings to a synthesis system, and asks it to find a matching regular expression. This process can be modeled as a reference game (Lewis, 1979), where a speaker (the user) chooses a few utterances (strings) to give to the listener (the synthesizer), with the intention that the listener can infer the correct hypothesis (regular expression). This reference game is characterized by the lexicon M𝑀Mitalic_M, a boolean matrix of 1s and 0s (Figure 2). In M𝑀Mitalic_M, each row corresponds to an utterance/example and each column corresponds to a hypothesis/program, and 1s indicating consistency of its corresponding utterance and a hypothesis: whether the program’s output (e.g. deciding whether a regular expression matches a string) is consistent with the example (e.g. the string). As we can see, a given utterance (such as 001) may be consistent with multiple hypotheses (0+{1}, 0{2}1+, and 0+1*).

2.2 A Literal Program Synthesizer

How might we build a system that takes an utterance (say 01) and produces the intended hypothesis 0+1{1}? As 01 is consistent with multiple hypotheses (0+1{1} and 0+1*), a naive strategy is to treat all consistent hypotheses as equally likely, scaled by a prior distribution of hypotheses P(w)𝑃𝑤P(w)italic_P ( italic_w ):

L0(w|u)subscript𝐿0conditional𝑤𝑢\displaystyle L_{0}(w|u)italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w | italic_u ) P(w)M[u,w]proportional-toabsent𝑃𝑤𝑀𝑢𝑤\displaystyle\propto P(w)M[u,w]∝ italic_P ( italic_w ) italic_M [ italic_u , italic_w ] (1)
=P(w)M[u,w]wM[u,w]absent𝑃𝑤𝑀𝑢𝑤subscriptsuperscript𝑤𝑀𝑢superscript𝑤\displaystyle=P(w)\frac{M[u,w]}{\sum_{w^{\prime}}M[u,w^{\prime}]}= italic_P ( italic_w ) divide start_ARG italic_M [ italic_u , italic_w ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_M [ italic_u , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_ARG (2)

A synthesizer built this way is a literal listener L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Bergen et al., 2016). Assuming the prior P(w)𝑃𝑤P(w)italic_P ( italic_w ) is uniform over programs, we can construct it by normalizing the rows of the matrix M𝑀Mitalic_M, resulting in a probability distribution over hypotheses W𝑊Witalic_W given utterances u𝑢uitalic_u (Figure 2). As we can see, given the utterance 01, this listener predicts an equal probability of 0+1{1} and 0+1* being the intended program.

2.3 A Pragmatic Synthesizer from a Single Example

A key insight to improving on the literal synthesizer is to consider that a user is cooperatively choosing an utterance to be informative about the intended program to the synthesizer. The Rational Speech Acts (RSA) framework models this informative choice of utterances using recursive Bayesian reasoning (Frank & Goodman, 2012). By reasoning about why a speaker (user) might have chosen a particular utterance (examples), rather than possible alternatives, the listener (synthesizer) can disambiguate the hypothesis (program) to which the speaker was referring to. Formally, the RSA framework produces a chain of alternating listeners and speakers beginning with the L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT model above.

S1(u|w)L0(w|u)=L0(w|u)uL0(w|u)L1(w|u)S1(u|w)=S1(u|w)wS1(u|w)subscript𝑆1conditional𝑢𝑤proportional-tosubscript𝐿0conditional𝑤𝑢subscript𝐿0conditional𝑤𝑢subscriptsuperscript𝑢subscript𝐿0conditional𝑤superscript𝑢subscript𝐿1conditional𝑤𝑢proportional-tosubscript𝑆1conditional𝑢𝑤subscript𝑆1conditional𝑢𝑤subscriptsuperscript𝑤subscript𝑆1conditional𝑢superscript𝑤\begin{array}[]{ccccc}S_{1}(u|w)&\propto&L_{0}(w|u)&=&\frac{L_{0}(w|u)}{\sum_{% u^{\prime}}L_{0}(w|u^{\prime})}\\ L_{1}(w|u)&\propto&S_{1}(u|w)&=&\frac{S_{1}(u|w)}{\sum_{w^{\prime}}S_{1}(u|w^{% \prime})}\end{array}start_ARRAY start_ROW start_CELL italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u | italic_w ) end_CELL start_CELL ∝ end_CELL start_CELL italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w | italic_u ) end_CELL start_CELL = end_CELL start_CELL divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w | italic_u ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w | italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w | italic_u ) end_CELL start_CELL ∝ end_CELL start_CELL italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u | italic_w ) end_CELL start_CELL = end_CELL start_CELL divide start_ARG italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u | italic_w ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u | italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW end_ARRAY (3)

Applying this framework amounts to normalizing the columns of the L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT matrix to obtain a pragmatic speaker distribution S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, then normalizing the rows of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to obtain a pragmatic listener (synthesizer), L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Figure 2). As we can see, given the utterance 01, this listener prefers 0+1{1} over 0+1*, reflecting the reasoning that if the user wanted to refer to 0+1*, they might have provided an example that highlights the possibility of no 1s in the string. In this paper, we shall call this algorithm RSA_single. As this algorithm only depends on M𝑀Mitalic_M, it is applicable to all program synthesis domains where programs and examples can be effectively enumerated.

2.4 A Pragmatic Synthesizer from Multiple Examples

Refer to caption
Figure 3: In the case of incremental RSA, the meaning matrix becomes smaller as more utterances are given, as each utterance rules out hypotheses that are inconsistent with it.

RSA_single is capable of producing a program synthesis algorithm from a single example. However, the users will typically have to clarify their intent interactively, by giving a sequence of multiple utterances 𝐮=u1,u2,,un𝐮subscript𝑢1subscript𝑢2subscript𝑢𝑛\mathbf{u}=u_{1},u_{2},\ldots,u_{n}bold_u = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The synthesizer must infer the intended program after every turn. With each new utterance, the meaning matrix M𝑀Mitalic_M becomes smaller, as hypotheses inconsistent with the new utterance are ruled out (Figure 3). This is an instance of incremental RSA (Cohn-Gordon et al., 2018b), which models the informative speaker S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT generating utterances auto-regressively:

S1(𝐮|w)subscript𝑆1conditional𝐮𝑤\displaystyle S_{1}(\mathbf{u}|w)italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u | italic_w ) =S1(u1,u2,,un|w)absentsubscript𝑆1subscript𝑢1subscript𝑢2conditionalsubscript𝑢𝑛𝑤\displaystyle=S_{1}(u_{1},u_{2},\ldots,u_{n}|w)= italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_w )
=i=1nS1(ui|w,u1,,ui1)absentsuperscriptsubscriptproduct𝑖1𝑛subscript𝑆1conditionalsubscript𝑢𝑖𝑤subscript𝑢1subscript𝑢𝑖1\displaystyle=\prod_{i=1}^{n}S_{1}(u_{i}|w,u_{1},\ldots,u_{i-1})= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_w , italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )
=i=1nL0(w|u1,,ui)wL0(w|u1,,ui)absentsuperscriptsubscriptproduct𝑖1𝑛subscript𝐿0conditional𝑤subscript𝑢1subscript𝑢𝑖subscriptsuperscript𝑤subscript𝐿0conditionalsuperscript𝑤subscript𝑢1subscript𝑢𝑖\displaystyle=\prod_{i=1}^{n}\frac{L_{0}(w|u_{1},\ldots,u_{i})}{\sum_{w^{% \prime}}L_{0}(w^{\prime}|u_{1},\ldots,u_{i})}= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w | italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG

In essense, the S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the product of multiple single-utterance S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT computed on separate meaning matrixes (like those in Figure 3). The synthesizer L1(w|𝐮)subscript𝐿1conditional𝑤𝐮L_{1}(w|\mathbf{u})italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w | bold_u ) is defined recursively on top of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, L1(w|𝐮)S1(𝐮|w)proportional-tosubscript𝐿1conditional𝑤𝐮subscript𝑆1conditional𝐮𝑤L_{1}(w|\mathbf{u})\propto S_{1}(\mathbf{u}|w)italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w | bold_u ) ∝ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u | italic_w ).

Pu et al. (2020) builds on top of the incremental RSA algorithm with additional memoization strategies. In this work, we shall call their algorithm RSA. Similar to RSA_single, this algorithm is applicable to enumerative program synthesis domains such as Feser et al. (2015); Solar-Lezama (2008); Gulwani (2011).

2.5 Exact RSA is Slow

In practice, it is infeasible to explicitly store the matrices M,L0,S1,L1𝑀subscript𝐿0subscript𝑆1subscript𝐿1M,L_{0},S_{1},L_{1}italic_M , italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Instead, computing L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using RSA requires 𝒪(|W|)𝒪𝑊\mathcal{O}(|W|)caligraphic_O ( | italic_W | ) calls to S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Each call to compute S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT requires 𝒪(|U|)𝒪𝑈\mathcal{O}(|U|)caligraphic_O ( | italic_U | ) calls to L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which in turn requires 𝒪(|W|)𝒪𝑊\mathcal{O}(|W|)caligraphic_O ( | italic_W | ) operations to determine a set of consistent programs. In practice, the pragmatic synthesizer L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT runs in 𝒪(|W|2|U|)𝒪superscript𝑊2𝑈\mathcal{O}(|W|^{2}|U|)caligraphic_O ( | italic_W | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_U | ) time. In the incremental RSA setting with multiple (say \ellroman_ℓ) utterances, the runtime of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is 𝒪(|W|2|U|)𝒪superscript𝑊2𝑈\mathcal{O}(|W|^{2}|U|\ell)caligraphic_O ( | italic_W | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_U | roman_ℓ ). As the number of hypotheses and utterances becomes large in a program synthesis domain, it becomes infeasible to compute L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at a speed required for end-user interactions.

3 Amortizing RSA with Rankings

We explain how the pragmatic listener L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, derived from the RSA algorithm can be amortized using a single global ranking of programs.

Finding Consistent Programs

Finding correct programs given a sequence of examples 𝐮=u1,u2,𝐮subscript𝑢1subscript𝑢2\mathbf{u}=u_{1},u_{2},\dotsbold_u = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … is the primary challenge of program synthesis, with solutions ranging from enumeration (Feser et al., 2015), constraint solving (Solar-Lezama et al., 2006), neuro-symbolic (Polosukhin & Skidanov, 2018; Balog et al., 2016), and using large language models for code (Li et al., 2022). In this work, we assume the a set of k𝑘kitalic_k consistent programs w1,w2,,wksubscript𝑤1subscript𝑤2subscript𝑤𝑘w_{1},w_{2},\dots,w_{k}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be found using any of these techniques.

Ranking Consistent Programs with a Prior

A global ranking σ𝜎\sigmaitalic_σ is an un-normalized prior (a score) over all programs. The global ranking is example-agnostic: given two programs wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, either σ[wa]σ[wb]succeeds𝜎delimited-[]subscript𝑤𝑎𝜎delimited-[]subscript𝑤𝑏\sigma[w_{a}]\succ\sigma[w_{b}]italic_σ [ italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] ≻ italic_σ [ italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] or σ[wa]σ[wb]precedes𝜎delimited-[]subscript𝑤𝑎𝜎delimited-[]subscript𝑤𝑏\sigma[w_{a}]\prec\sigma[w_{b}]italic_σ [ italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] ≺ italic_σ [ italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ], irrespective of the given examples 𝐮𝐮\mathbf{u}bold_u.

Lσ(w|𝐮)σ[w]M[𝐮,w]proportional-tosubscript𝐿𝜎conditional𝑤𝐮𝜎delimited-[]𝑤𝑀𝐮𝑤\displaystyle L_{\sigma}(w|\mathbf{u})\propto\sigma[w]M[\mathbf{u},w]italic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_w | bold_u ) ∝ italic_σ [ italic_w ] italic_M [ bold_u , italic_w ]

As we can see, ranking the consistent programs under σ[w]𝜎delimited-[]𝑤\sigma[w]italic_σ [ italic_w ] can be very efficient. In practice, efficient synthesis algorithms are built using either domain-specific heuristics for rankings (Singh & Gulwani, 2015; Polozov & Gulwani, 2015), or a learned prior from a code corpus (Li et al., 2022).

Ranking with L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Rather than relying on heuristics or learning from a large corpus, RSA automatically derives a ranked synthesizer L1(w|𝐮)subscript𝐿1conditional𝑤𝐮L_{1}(w|\mathbf{u})italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w | bold_u ):

L1(w|𝐮)S1(𝐮|w)M[𝐮,w]proportional-tosubscript𝐿1conditional𝑤𝐮subscript𝑆1conditional𝐮𝑤𝑀𝐮𝑤\displaystyle L_{1}(w|\mathbf{u})\propto S_{1}(\mathbf{u}|w)M[\mathbf{u},w]italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w | bold_u ) ∝ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u | italic_w ) italic_M [ bold_u , italic_w ]

To rank the consistent programs, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT uses S1(𝐮|w)subscript𝑆1conditional𝐮𝑤S_{1}(\mathbf{u}|w)italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u | italic_w ), an example-dependent ranking function, that ranks the satisfying programs differently depending on the sequences of examples 𝐮𝐮\mathbf{u}bold_u given. In this setting with multiple examples, there could be cycles where a pair222or a triple or larger cycles of satisfying programs wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, which is ranked S1(𝐮1|wa)>S1(𝐮1|wb)subscript𝑆1conditionalsubscript𝐮1subscript𝑤𝑎subscript𝑆1conditionalsubscript𝐮1subscript𝑤𝑏S_{1}(\mathbf{u}_{1}|w_{a})>S_{1}(\mathbf{u}_{1}|w_{b})italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) > italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) under some examples 𝐮1subscript𝐮1\mathbf{u}_{1}bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ranked S1(𝐮2|wa)<S1(𝐮2|wb)subscript𝑆1conditionalsubscript𝐮2subscript𝑤𝑎subscript𝑆1conditionalsubscript𝐮2subscript𝑤𝑏S_{1}(\mathbf{u}_{2}|w_{a})<S_{1}(\mathbf{u}_{2}|w_{b})italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) < italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) given different examples 𝐮2subscript𝐮2\mathbf{u}_{2}bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In this work, we assume that S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be tractably computed at non-interactive speed.

Amortizing L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with a Ranking

In this work, we explore whether the example-dependent ranking of S1(𝐮|w)subscript𝑆1conditional𝐮𝑤S_{1}(\mathbf{u}|w)italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_u | italic_w ) can be approximated — to have similar top-k𝑘kitalic_k responses — with an example-agnostic ranking function σ[w]𝜎delimited-[]𝑤\sigma[w]italic_σ [ italic_w ]. Note that due to the existence of cycles, it may be impossible to find a global ranking that is consistent with all example-dependent rankings. Our key findings are as follows:

Key Finding 1: One can distill a pragmatic ranking σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT from L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. While this is an approximation, it nonetheless retains much of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT’s communicative accuracy when interacting with end-users, and running orders of magnetudes faster.

Key Finding 2: In the special case where only a single example is used, RSA_single, the approximation can be made exact: There exists a global ranking σsuperscript𝜎\sigma^{*}italic_σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that perfectly matches the top-k responses of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over any example u𝑢uitalic_u.

4 Distilling L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of RSA to a Global Ranking

Distilling the example-dependent L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rankings into a global ranking has two stages. First, we generate a dataset of D={(w,𝐮,σ~𝐮),}𝐷𝑤𝐮subscript~𝜎𝐮D=\{(w,\mathbf{u},\tilde{\sigma}_{\mathbf{u}}),\dots\}italic_D = { ( italic_w , bold_u , over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) , … }, where w𝑤witalic_w is a program, 𝐮𝐮\mathbf{u}bold_u is a specification (sequence of examples) used to describe w𝑤witalic_w, and σ~𝐮=[w1,w2,,wk]subscript~𝜎𝐮subscript𝑤1subscript𝑤2subscript𝑤𝑘\tilde{\sigma}_{\mathbf{u}}=[w_{1},w_{2},\dots,w_{k}]over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] are the k𝑘kitalic_k example-dependent rankings of consistent programs given 𝐮𝐮\mathbf{u}bold_u.333it is a mouthful, we are terribly sorry Then, we distill a global ranking that aggregates the example-dependent rankings in D𝐷Ditalic_D.

4.1 Dataset Generation via Simulated Communications

The pragmatic listener L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can generate a partial ranking of consistent programs for any sequences of examples 𝐮𝐮\mathbf{u}bold_u. As arbitrary examples 𝐮𝐮\mathbf{u}bold_u are unlikely to reflect what a user might give at inference time, we use the informative speaker S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as a “stand-in”. Specifically, we generate D𝐷Ditalic_D in a form of simulated interactions between the pragmatic speaker S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the pragmatic listener L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We enumerate over the set of programs wW𝑤𝑊w\in Witalic_w ∈ italic_W, then use the pragmatic speaker to sample the most likely specifications (sequence of examples) 𝐮top1S1(|w)\mathbf{u}\sim_{top-1}S_{1}(\cdot|w)bold_u ∼ start_POSTSUBSCRIPT italic_t italic_o italic_p - 1 end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ | italic_w ) of length 1111 to length N𝑁Nitalic_N. For each specification, we query L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for a partial ranking σ~𝐮subscript~𝜎𝐮\tilde{\sigma}_{\mathbf{u}}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT of consistent programs, and add it to the dataset D𝐷Ditalic_D (Algorithm 1).

0:  Set of programs W𝑊Witalic_W
0:  Length of specification to generate N𝑁Nitalic_N
0:  Speaker model S(u|w,𝐮)𝑆conditional𝑢𝑤𝐮S(u|w,\mathbf{u})italic_S ( italic_u | italic_w , bold_u )
0:  Listener model L(w|𝐮)𝐿conditional𝑤𝐮L(w|\mathbf{u})italic_L ( italic_w | bold_u )
0:  Function MakeRanking that ranks samples from a distribution based on the probability
  𝒟{}𝒟\mathcal{D}\leftarrow\{\}caligraphic_D ← { }
  for w𝑤witalic_w in W𝑊Witalic_W do
     𝐮[]𝐮\mathbf{u}\leftarrow\left[\ \right]bold_u ← [ ]
     for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
        unextargmaxuS(u|w,𝐮)subscript𝑢nextsubscriptargmax𝑢𝑆conditional𝑢𝑤𝐮u_{\textrm{next}}\leftarrow\operatorname*{arg\,max}_{u}S(u|w,\mathbf{u})italic_u start_POSTSUBSCRIPT next end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_S ( italic_u | italic_w , bold_u )
        𝐮𝐮+[unext]𝐮𝐮delimited-[]subscript𝑢next\mathbf{u}\leftarrow\mathbf{u}+\left[u_{\textrm{next}}\right]bold_u ← bold_u + [ italic_u start_POSTSUBSCRIPT next end_POSTSUBSCRIPT ]
        σ~𝐮MakeRanking(L(|𝐮))\tilde{\sigma}_{\mathbf{u}}\leftarrow\textsc{MakeRanking}(L(\cdot|\mathbf{u}))over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ← MakeRanking ( italic_L ( ⋅ | bold_u ) )
        𝒟𝒟{(w,𝐮,σ~𝐮)}𝒟𝒟𝑤𝐮subscript~𝜎𝐮\mathcal{D}\leftarrow\mathcal{D}\cup\{(w,\mathbf{u},\tilde{\sigma}_{\mathbf{u}% })\}caligraphic_D ← caligraphic_D ∪ { ( italic_w , bold_u , over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ) }
     end for
  end for

Algorithm 1 Algorithm to obtain a dataset of simulated interactions between a speaker S𝑆Sitalic_S and listener L𝐿Litalic_L. For each turn of each interaction, a ranking of programs is obtained.
0:  Dataset of simulated interactions 𝒟𝒟\mathcal{D}caligraphic_D
  σ𝜎absent\sigma\leftarrowitalic_σ ← randomly initialized ranking
  converged \leftarrow false
  Nswaps[]subscript𝑁swapsN_{\textrm{swaps}}\leftarrow\left[\ \right]italic_N start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ← [ ]
  nswaps0subscript𝑛swaps0n_{\textrm{swaps}}\leftarrow 0italic_n start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ← 0
  i0𝑖0i\leftarrow 0italic_i ← 0
  while not converged do
     (p,σ~,𝐮)𝒟similar-to𝑝~𝜎𝐮𝒟(p,\tilde{\sigma},\mathbf{u})\sim\mathcal{D}( italic_p , over~ start_ARG italic_σ end_ARG , bold_u ) ∼ caligraphic_D
     Sample programs p1,p2subscript𝑝1subscript𝑝2p_{1},p_{2}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in σ~~𝜎\tilde{\sigma}over~ start_ARG italic_σ end_ARG
     if σ~[p1]σ~[p2]succeeds~𝜎delimited-[]subscript𝑝1~𝜎delimited-[]subscript𝑝2\tilde{\sigma}\left[p_{1}\right]\succ\tilde{\sigma}\left[p_{2}\right]over~ start_ARG italic_σ end_ARG [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≻ over~ start_ARG italic_σ end_ARG [ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] and σ[p1]σ[p2]precedes𝜎delimited-[]subscript𝑝1𝜎delimited-[]subscript𝑝2\sigma\left[p_{1}\right]\prec\sigma\left[p_{2}\right]italic_σ [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≺ italic_σ [ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] then
        Swap σ[p1],σ[p2]𝜎delimited-[]subscript𝑝1𝜎delimited-[]subscript𝑝2\sigma\left[p_{1}\right],\sigma\left[p_{2}\right]italic_σ [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , italic_σ [ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
        nswapsnswaps+1subscript𝑛swapssubscript𝑛swaps1n_{\textrm{swaps}}\leftarrow n_{\textrm{swaps}}+1italic_n start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ← italic_n start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT + 1
     end if
     ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1
     if i0modV𝑖modulo0𝑉i\equiv 0\mod Vitalic_i ≡ 0 roman_mod italic_V then
        NswapsNswaps+[nswaps]subscript𝑁swapssubscript𝑁swapsdelimited-[]subscript𝑛swapsN_{\textrm{swaps}}\leftarrow N_{\textrm{swaps}}+\left[n_{\textrm{swaps}}\right]italic_N start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT + [ italic_n start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ], nswaps0subscript𝑛swaps0n_{\textrm{swaps}}\leftarrow 0italic_n start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT ← 0
        if maxNswaps[t:]minNswaps[t:]<T\max N_{\textrm{swaps}}\left[-t:\right]-\min N_{\textrm{swaps}}\left[-t:\right% ]<Troman_max italic_N start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT [ - italic_t : ] - roman_min italic_N start_POSTSUBSCRIPT swaps end_POSTSUBSCRIPT [ - italic_t : ] < italic_T then
           converged \leftarrow true
        end if
     end if
  end while
  return σ𝜎\sigmaitalic_σ
Algorithm 2 Algorithm to infer a global order σ𝜎\sigmaitalic_σ based on a dataset of simulated interactions, that terminates based on a validation criterion determined by the validation frequency V𝑉Vitalic_V, patience t𝑡titalic_t and convergence threshold T𝑇Titalic_T

4.2 Distillation via Annealing

The most straight-forward representation of a ranking is as an explicit list of programs σglobal=[w1,w2,,wn]subscript𝜎globalsubscript𝑤1subscript𝑤2subscript𝑤𝑛\sigma_{\textrm{global}}=[w_{1},w_{2},\ldots,w_{n}]italic_σ start_POSTSUBSCRIPT global end_POSTSUBSCRIPT = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]. We describe a process of finding an approximate global ranking using annealing. We repeatedly sample example-dependent rankings σ~𝐮subscript~𝜎𝐮\tilde{\sigma}_{\mathbf{u}}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT from D𝐷Ditalic_D, and update the global ranking σglobalsubscript𝜎global\sigma_{\textrm{global}}italic_σ start_POSTSUBSCRIPT global end_POSTSUBSCRIPT to match σ~𝐮subscript~𝜎𝐮\tilde{\sigma}_{\mathbf{u}}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT for a single pair of programs sampled from σ~𝐮subscript~𝜎𝐮\tilde{\sigma}_{\mathbf{u}}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT. Since cycles exist in example-dependent rankings, we terminate the annealing procedure once the number of swaps in a sliding window has stabilized (Algorithm 2). The resulting σglobalsubscript𝜎global\sigma_{\textrm{global}}italic_σ start_POSTSUBSCRIPT global end_POSTSUBSCRIPT is then used at inference time.

4.3 Distillation via Learning a Score Function

An alternative method to distill D𝐷Ditalic_D is to train a score function sθ:w:subscript𝑠𝜃𝑤s_{\theta}:w\to\mathbb{R}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : italic_w → blackboard_R that determines a score for a program w𝑤witalic_w that is independent of the specifications 𝐮𝐮\mathbf{u}bold_u. We can optimize θ𝜃\thetaitalic_θ to minimize disagreement with the generated dataset of example-dependent rankings, by minimizing the loss

(θ)=𝜃absent\displaystyle\mathcal{L}(\theta)=caligraphic_L ( italic_θ ) =
𝔼σ~𝐮𝒟w1,w2σ~𝐮:σ~𝐮[w1]σ~𝐮[w2]log(sig(sθ(w1)sθ(w2)))similar-tosubscript~𝜎𝐮𝒟:similar-tosubscript𝑤1subscript𝑤2subscript~𝜎𝐮succeedssubscript~𝜎𝐮delimited-[]subscript𝑤1subscript~𝜎𝐮delimited-[]subscript𝑤2𝔼sigsubscript𝑠𝜃subscript𝑤1subscript𝑠𝜃subscript𝑤2\displaystyle\underset{{\begin{subarray}{c}\tilde{\sigma}_{\mathbf{u}}\sim% \mathcal{D}\\ w_{1},w_{2}\sim\tilde{\sigma}_{\mathbf{u}}:\ \tilde{\sigma}_{\mathbf{u}}[w_{1}% ]\succ\tilde{\sigma}_{\mathbf{u}}[w_{2}]\end{subarray}}}{\mathbb{E}}-\log(% \mathrm{sig}(s_{\theta}(w_{1})-s_{\theta}(w_{2})))start_UNDERACCENT start_ARG start_ROW start_CELL over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ∼ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT : over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≻ over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT [ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG - roman_log ( roman_sig ( italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) )

where sigsig\mathrm{sig}roman_sig is the sigmoid function. This follows estimating a score function from a set of pairwise preferences (Bradley & Terry, 1952; Christiano et al., 2017). We parametrize sθsubscript𝑠𝜃s_{\theta}italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as a small neural network that scores programs. To reduce variance, we fit an ensemble of score functions and use their average to rank the consistent programs at inference time (Christiano et al., 2017). Details of the neural models are in Appendix E.

Refer to caption
Figure 4: Grammar for the regex domain

5 Experiments

To validate the accuracy and run-time of an approximate ranking listener, we perform two sets of experiments. First, we conduct a small (n=8𝑛8n=8italic_n = 8) human experiment by building a ranking-based synthesizer in a regular expression synthesis domain where it is infeasible to run the RSA algorithm L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at interaction time. Second, we conduct two replay studies by simulating virtual users giving examples one after another using human interaction data collected from prior works. We seek to answer the following questions: (Q1) Can ranking based synthesizers accurately infer programs from humans (both in live interaction and in simulated replays)? (Q2) Are ranking-based synthesizers fast to run when compared to L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT?

Metrics

In our experiments, the users (real or simulated) will be given a target program, and attempt to communicate it to the synthesizers using examples. The synthesizers will be measured on their communication accuracy — whether the synthesizers can infer the target program from the examples given. A synthesizer is better than another if it can recover the target program using fewer examples.

5.1 Interactive User Study

Refer to caption
Figure 5: Success rate of the literal L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ranking-based Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT synthesizers inferring the correct regex as a function of numbers of examples given (turn). Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT achieves a success rate of 93.75%, L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT achieves only 65.63%. The ranking-based synthesizer also achieves higher success with fewer utterances. Bands indicate 95% CI over 24 regexes for each condition.

We conduct a user study where people interacted with both the ranking-based synthesizer distilled with annealing Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT and the literal synthesizer L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the domain of regular expression synthesis.

The Regex Domain

The regex domain is a scaled up version of Vaithilingam et al. (2023), which has a total of 350 regular expressions from their grammar (Figure 4. For this study, we expanded the space of programs to 3500 regular expressions from the same grammar – a setting that would make live interaction infeasible running L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with RSA.

Procedure

We recruited 8 participants from our institution. Each participant was given a short tutorial on how to use the interface, then attempted to communicate a total of 4 regexes using examples. For each regex, the participant communicated with both the literal synthesizer L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the ranking synthesizer Lσsubscript𝐿𝜎L_{\sigma}italic_L start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT, anonymized as simply a “green robot” and a “blue robot” in randomized order. The participants gave example strings one at a time until the regex is recovered by the synthesizer, or they may give up early. The communication is interactive: When the participant added a new example, they were immediately shown the current top-1 guess of the synthesizer, which allowed them to choose the next example accordingly.

Results: end-users interact well with an amortized ranking synthesizer (Q1)

Figure 5 shows the communication success rate over numbers of given exmaples (turns) for both the literal and ranking-based synthesizers. We can see that (1) Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT has a higher overall success rate with humans, and (2) It also achieves a higher success rate with fewer number of examples (Q1).

5.2 Simulated User Studies Using Replays

We evaluate the ranking-based synthesizers by replaying the interaction data collected from Vaithilingam et al. (2023) and Pu et al. (2020) – small pragmatic program synthesis domains where it is feasible to run L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with RSA.

Refer to caption
Figure 6: Animal domain replay results. Fraction of successfully communicated target programs (success) vs number of examples given (turn). Bands are 95% confidence interval across interactions (254 for H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 291 for H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). Two kinds of simulated speakers: H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT — replaying the interactions where participants communicated with a literal L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT synthesizer from the original study; H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT — with a pragmatic L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT synthesizer. On H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT replay, L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT performs worst (63.78%), and Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT (71.65%), L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (74.80%), Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT (78.34%) performing similarly to each other. On H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT replay, L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (15.46%) performs worst, with Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT (49.82%) in the middle, while L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (91.75%) and Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT (86.94%) perform best.
Refer to caption
Figure 7: Regex domain replay results. Bands are 95% confidence interval across interactions (60 interactions for both H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). On H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT replay, L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (28.33%) performs worst, Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT (35.00%) slightly better, and Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT (81.67%), L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (88.33%) perform best. On H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT replay we observe the same trend, with L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (13.33%), Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT (28.33%), Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT (68.33%), L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (81.67%) respectively.

Replay Data

In the human studies by Vaithilingam et al. (2023) and Pu et al. (2020), a human H𝐻Hitalic_H is given a target program w𝑤witalic_w, and attempt to get the synthesizer (L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) to infer the target using a sequence of examples 𝐮=u1,u2,𝐮subscript𝑢1subscript𝑢2\mathbf{u}=u_{1},u_{2},\dotsbold_u = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …. Thus, two sets of data are generated, one where the human is interacting with the literal synthesizer L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which we term H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and one where the human is interacting with the pragmatic synthesizer L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which we term H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Specifically, from each domain we extract the following dataset {(w,𝐮ij)|wWs,jP,i{0,1}}conditional-set𝑤superscriptsubscript𝐮𝑖𝑗formulae-sequence𝑤subscript𝑊𝑠formulae-sequence𝑗𝑃𝑖01\{(w,\mathbf{u}_{i}^{j})|w\in W_{s},j\in P,i\in\{0,1\}\}{ ( italic_w , bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) | italic_w ∈ italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j ∈ italic_P , italic_i ∈ { 0 , 1 } }. Here, Wssubscript𝑊𝑠W_{s}italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are the set of programs used for the human study (the stimuli), P𝑃Pitalic_P is the set of participants, and i𝑖iitalic_i indicates if the participant is communicating with L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Refer to caption
Refer to caption
Figure 8: The wall clock time for each synthesizer given different numbers of examples (turn). We see that L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is consistently much slower than either Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT or L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in both domains. Note that time is on a logarithmic scale for the animals domain. The difference slopes for L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (trending up for regex and trending down for animals) is due to an optimization of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT synthesizer for the animals domain, which filters out invalid programs as a pre-proccessing step using L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, making it having to rank fewer programs over turns

Experiment Setup

We can simulate an user interaction by using the replay data. Given a datapoint w,𝐮𝑤𝐮w,\mathbf{u}italic_w , bold_u, we create a simulated user that iteratively gives the examples u1,u2,subscript𝑢1subscript𝑢2u_{1},u_{2},\dotsitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … in multiple turns to communicate a given target program w𝑤witalic_w. At every turn, the synthesizer returns the top-1 responses, Ltop-1(u1),Ltop-1(u1,u2),superscript𝐿top-1subscript𝑢1superscript𝐿top-1subscript𝑢1subscript𝑢2L^{\textrm{top-1}}(u_{1}),L^{\textrm{top-1}}(u_{1},u_{2}),\dotsitalic_L start_POSTSUPERSCRIPT top-1 end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_L start_POSTSUPERSCRIPT top-1 end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , …, and we can check if any of them matches the target program w𝑤witalic_w. If they do, we mark the communication as successful and stop early. Otherwise, we keep adding examples until the 𝐮𝐮\mathbf{u}bold_u runs out, and we mark the communication as unsuccessful. Note that our evaluation cannot account for a user adapting their choice of examples to L𝐿Litalic_L, as the simulated user can only give scripted examples according to the replay data.

Domain 1: Animals

Pu et al. (2020) used a domain of grid patterns generated by an underlying domain-specific language (see Appendix for the grammar of the DSL and semantics). The space contains 17,976 semantically distinct programs and 343 possible examples, where a user uses a sequence of multiple examples to communicate a target program. They conducted a study with 48 human subjects, collecting data for 10 programs (10 distinct grid patters). The data includes interactions between humans and both a literal synthesizer (H0L0subscript𝐻0subscript𝐿0H_{0}-L_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and a pragmatic synthesizer (H1L1subscript𝐻1subscript𝐿1H_{1}-L_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). In total, there are 254 interactions from H0L0subscript𝐻0subscript𝐿0H_{0}-L_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 291 interactions from from H1L1subscript𝐻1subscript𝐿1H_{1}-L_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where each interaction consists of multiple turns until either the target program is successfully communicated or the user gives up.

Domain 2: Regular expressions

Vaithilingam et al. (2023) studied the usability of pragmatic program synthesizers in the domain of binary regular expressions. The space contains 350 distinct regular expressions. A sample of 2000 strings was used to compute the S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distributions. Their study included 30 participants interacting with both L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT models. In total, there are 60 interactions from H0L0subscript𝐻0subscript𝐿0H_{0}-L_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 60 interactions from from H1L1subscript𝐻1subscript𝐿1H_{1}-L_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where each consisting of multiple turns.

Result: rank-based synthesizers are comparable to L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in terms of communication accuracy with simulated users (Q1)

The replay study results are shown in Figure 6 (animals domain) and Figure 7 (regex domain). For either domain, there is a rank-based synthesizer that vastly out-performs the literal synthesizer L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and is close to performance to the pragmatic synthesizer L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT derived from RSA.

The existence of a rank-based synthesizer (be it Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT or Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT) that matches the performance of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT entails that there exists some ranking of programs that effectively amortizes L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for either domain. For the animals domain, Lneuralsubscript𝐿neuralL_{\textrm{neural}}italic_L start_POSTSUBSCRIPT neural end_POSTSUBSCRIPT is better able to discover an effective ranking, while Lannealsubscript𝐿annealL_{\textrm{anneal}}italic_L start_POSTSUBSCRIPT anneal end_POSTSUBSCRIPT is more effective at discovering the ranking for the regex domain. This is likely due to the differences of the sizes of the communicative datasets for the two domains — 17,976 programs for the animals domain vs 350 for the animals domain, which makes it more feasible to learn a generalizable neural scoring function for the animals domain.

Result: rank-based synthesizers are orders of magnetudes faster than L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Q2)

For both domains, the ranking-based synthesizer is much faster than L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, requiring approximately the same time as L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Figure 8). This implies that most of the computation cost of a ranking-based synthesizer lies in coming up with consistent programs — the primary challenge of program synthesis — while the computation for ranking the top-k𝑘kitalic_k programs can be made negligible in comparison (Q2).

6 RSA_single Can Be Distilled Completely

In this section, we prove a strong approximation result for a special case of RSA, RSA_single, where only a single example u𝑢uitalic_u is used to communicate. In accordance with the terminologies of Goodman & Frank (2016); Vogel et al. (2013); Monroe & Potts (2015); Smith et al. (2013) and Franke & Degen (2016), we’ll use the term “hypothesis” instead of “program”. We prove that a global pragmatic ranking of hypotheses must exist for any listeners L0,L1,subscript𝐿0subscript𝐿1L_{0},L_{1},\dotsitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … resulting from the RSA_single algorithm.444one can derive the same result for pragmatic ranking of speakers by taking a transpose of M𝑀Mitalic_M In other words, the rankings over consistent hypotheses in these listeners are example-agnostic.

Theorem:

For a sequence of listeners in the RSA algorithm L0,L1,subscript𝐿0subscript𝐿1L_{0},L_{1},\dotsitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … over a boolean-valued lexicon M𝑀Mitalic_M, there exists a sequence of global pragmatic rankings σL0,σL1,subscript𝜎subscript𝐿0subscript𝜎subscript𝐿1\sigma_{L_{0}},\sigma_{L_{1}},\dotsitalic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … such that:

w,w,u.ifLi(w|u)>0Li(w|u)>0.thenLi(w|u)>Li(w|u)σLi[w]σLi[w]\displaystyle\begin{split}\forall w,w^{\prime},u.~{}\textbf{if}~{}L_{i}(w|u)>0% \wedge L_{i}(w^{\prime}|u)>0.\\ \textbf{then}~{}L_{i}(w|u)>L_{i}(w^{\prime}|u)\iff\sigma_{L_{i}}[w]\succ\sigma% _{L_{i}}[w^{\prime}]\end{split}start_ROW start_CELL ∀ italic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u . if italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) > 0 ∧ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) > 0 . end_CELL end_ROW start_ROW start_CELL then italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) > italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) ⇔ italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ] ≻ italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_CELL end_ROW (4)

This means the partial rankings produced by any Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over consistent hypotheses are example-agnostic, where a global ranking preferring certain hypotheses unconditionally over others (e.g. a convention) is sufficient to explain the relative rankings of Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT resulting from RSA_single.

Proof:

Let M𝑀Mitalic_M be a boolean lexicon of size m𝑚mitalic_m rows and n𝑛nitalic_n columns. Let r0=r01r0msubscript𝑟0superscriptsubscript𝑟01superscriptsubscript𝑟0𝑚r_{0}=r_{0}^{1}\dots r_{0}^{m}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT be the row-normalizing vector such that r0j=(M[j,:])1superscriptsubscript𝑟0𝑗superscript𝑀𝑗:1r_{0}^{j}=(\sum M[j,:])^{-1}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ( ∑ italic_M [ italic_j , : ] ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which is to say, each element r0jsuperscriptsubscript𝑟0𝑗r_{0}^{j}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the normalization term for row j𝑗jitalic_j of L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let subscript\operatorname{*_{\leftrightarrow}}∗ start_POSTSUBSCRIPT ↔ end_POSTSUBSCRIPT denotes row-wise multiplication:

L0=Mr0subscript𝐿0𝑀subscriptsubscript𝑟0L_{0}=M\operatorname{*_{\leftrightarrow}}r_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_M start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↔ end_POSTSUBSCRIPT end_OPFUNCTION italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

Which is to say, starting from M𝑀Mitalic_M, L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be obtained by scaling each row j𝑗jitalic_j by their respective normalization constant r0jsuperscriptsubscript𝑟0𝑗r_{0}^{j}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Let c1=c11c1nsubscript𝑐1superscriptsubscript𝑐11superscriptsubscript𝑐1𝑛c_{1}=c_{1}^{1}\dots c_{1}^{n}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the col-normalizing vector such that c1j=(L0[:,j])1superscriptsubscript𝑐1𝑗superscriptsubscript𝐿0:𝑗1c_{1}^{j}=(\sum L_{0}[:,j])^{-1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ( ∑ italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT [ : , italic_j ] ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which is to say, each element c1jsuperscriptsubscript𝑐1𝑗c_{1}^{j}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the normalization term for column j𝑗jitalic_j of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Similarly, let subscript\operatorname{*_{\updownarrow}}∗ start_POSTSUBSCRIPT ↕ end_POSTSUBSCRIPT denotes column-wise multiplication

S1=L0c1=Mr0c1subscript𝑆1subscript𝐿0subscriptsubscript𝑐1𝑀subscriptsubscript𝑟0subscriptsubscript𝑐1S_{1}=L_{0}\operatorname{*_{\updownarrow}}c_{1}=M\operatorname{*_{% \leftrightarrow}}r_{0}\operatorname{*_{\updownarrow}}c_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↕ end_POSTSUBSCRIPT end_OPFUNCTION italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↔ end_POSTSUBSCRIPT end_OPFUNCTION italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↕ end_POSTSUBSCRIPT end_OPFUNCTION italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Computing Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under RSA amounts to applying row and column normalization alternatively multiple times:

Li=Mr0c1ci1risubscript𝐿𝑖𝑀subscriptsubscript𝑟0subscriptsubscript𝑐1subscriptsubscript𝑐𝑖1subscriptsubscript𝑟𝑖L_{i}=M\operatorname{*_{\leftrightarrow}}r_{0}\operatorname{*_{\updownarrow}}c% _{1}\dots\operatorname{*_{\updownarrow}}c_{i-1}\operatorname{*_{% \leftrightarrow}}r_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↔ end_POSTSUBSCRIPT end_OPFUNCTION italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↕ end_POSTSUBSCRIPT end_OPFUNCTION italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↕ end_POSTSUBSCRIPT end_OPFUNCTION italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_OPFUNCTION ∗ start_POSTSUBSCRIPT ↔ end_POSTSUBSCRIPT end_OPFUNCTION italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Let * be element-wise multiplication, let tensor-product\otimes be outer-product, we can rearrange the terms:

Li=M((r0ri)(c1ci1))=M(r0ic1i1)subscript𝐿𝑖𝑀tensor-productsubscript𝑟0subscript𝑟𝑖subscript𝑐1subscript𝑐𝑖1𝑀tensor-productsubscript𝑟0𝑖subscript𝑐1𝑖1\displaystyle\begin{split}L_{i}=&M*((r_{0}*\dots*r_{i})\otimes(c_{1}*\dots*c_{% i-1}))\\ =&M*(r_{0\dots i}\otimes c_{1\dots i-1})\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = end_CELL start_CELL italic_M ∗ ( ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∗ ⋯ ∗ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊗ ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ ⋯ ∗ italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL italic_M ∗ ( italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT ⊗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW (5)

Here, r0i=r0risubscript𝑟0𝑖subscript𝑟0subscript𝑟𝑖r_{0\dots i}=r_{0}*\dots*r_{i}italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∗ ⋯ ∗ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a vector of size m𝑚mitalic_m, and c1i1=c1ci1subscript𝑐1𝑖1subscript𝑐1subscript𝑐𝑖1c_{1\dots i-1}=c_{1}*\dots*c_{i-1}italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∗ ⋯ ∗ italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT is a vector of size n𝑛nitalic_n. As we can see, following the RSA algorithm, Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be decomposed to to multiplication of 2 parts: the lexicon M𝑀Mitalic_M, and a matrix that is formed by the outer product r0ic1i1tensor-productsubscript𝑟0𝑖subscript𝑐1𝑖1r_{0\dots i}\otimes c_{1\dots i-1}italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT ⊗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT 555note that any prior over hypotheses and utterance can be similarly absorbed into these outer products terms.

Claim: The ordered indexes of c1i1subscript𝑐1𝑖1c_{1\dots i-1}italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT is the global pragmatic ranking σLisubscript𝜎subscript𝐿𝑖\sigma_{L_{i}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

σLi[w]σLi[w]c1i1[w]>c1i1[w]iffsucceedssubscript𝜎subscript𝐿𝑖delimited-[]𝑤subscript𝜎subscript𝐿𝑖delimited-[]superscript𝑤subscript𝑐1𝑖1delimited-[]𝑤subscript𝑐1𝑖1delimited-[]superscript𝑤\sigma_{L_{i}}[w]\succ\sigma_{L_{i}}[w^{\prime}]\iff c_{1\dots i-1}[w]>c_{1% \dots i-1}[w^{\prime}]italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ] ≻ italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ⇔ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ] > italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

Proof: We show both sides of the iff\iff. Suppose that for some w,w,u𝑤superscript𝑤𝑢w,w^{\prime},uitalic_w , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u, both Li(w|u)>0subscript𝐿𝑖conditional𝑤𝑢0L_{i}(w|u)>0italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) > 0 and Li(w|u)>0subscript𝐿𝑖conditionalsuperscript𝑤𝑢0L_{i}(w^{\prime}|u)>0italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) > 0 (i.e. M[u,w]=M[u,w]=1𝑀𝑢𝑤𝑀𝑢superscript𝑤1M[u,w]=M[u,w^{\prime}]=1italic_M [ italic_u , italic_w ] = italic_M [ italic_u , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] = 1).

(1) Show \Rightarrow: Suppose Li(w|u)>Li(w|u)subscript𝐿𝑖conditional𝑤𝑢subscript𝐿𝑖conditionalsuperscript𝑤𝑢L_{i}(w|u)>L_{i}(w^{\prime}|u)italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) > italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ). We have

Li(w|u)=Li[u,w]=r0i[u]c1i1[w]subscript𝐿𝑖conditional𝑤𝑢subscript𝐿𝑖𝑢𝑤subscript𝑟0𝑖delimited-[]𝑢subscript𝑐1𝑖1delimited-[]𝑤\displaystyle L_{i}(w|u)=L_{i}[u,w]=r_{0\dots i}[u]*c_{1\dots i-1}[w]italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) = italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_w ] = italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT [ italic_u ] ∗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ]
Li(w|u)=Li[u,w]=r0i[u]c1i1[w]subscript𝐿𝑖conditionalsuperscript𝑤𝑢subscript𝐿𝑖𝑢superscript𝑤subscript𝑟0𝑖delimited-[]𝑢subscript𝑐1𝑖1delimited-[]superscript𝑤\displaystyle L_{i}(w^{\prime}|u)=L_{i}[u,w^{\prime}]=r_{0\dots i}[u]*c_{1% \dots i-1}[w^{\prime}]italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) = italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] = italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT [ italic_u ] ∗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

As r0i[u]subscript𝑟0𝑖delimited-[]𝑢r_{0\dots i}[u]italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT [ italic_u ] is a constant, we have

Li(w|u)>Li(w|u)c1i1[w]>c1i1[w].subscript𝐿𝑖conditional𝑤𝑢subscript𝐿𝑖conditionalsuperscript𝑤𝑢subscript𝑐1𝑖1delimited-[]𝑤subscript𝑐1𝑖1delimited-[]superscript𝑤\displaystyle L_{i}(w|u)>L_{i}(w^{\prime}|u)\Rightarrow c_{1\dots i-1}[w]>c_{1% \dots i-1}[w^{\prime}]~{}~{}\square.italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) > italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) ⇒ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ] > italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] □ .

(2) Show \Leftarrow: Suppose c1i1[w]>c1i1[w]subscript𝑐1𝑖1delimited-[]𝑤subscript𝑐1𝑖1delimited-[]superscript𝑤c_{1\dots i-1}[w]>c_{1\dots i-1}[w^{\prime}]italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ] > italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ].

c1i1[w]subscript𝑐1𝑖1delimited-[]𝑤\displaystyle c_{1\dots i-1}[w]italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ] >c1i1[w]absentsubscript𝑐1𝑖1delimited-[]superscript𝑤\displaystyle>c_{1\dots i-1}[w^{\prime}]> italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]
M[u,w]r0i[u]c1i1[w]𝑀𝑢𝑤subscript𝑟0𝑖delimited-[]𝑢subscript𝑐1𝑖1delimited-[]𝑤\displaystyle M[u,w]*r_{0\dots i}[u]*c_{1\dots i-1}[w]italic_M [ italic_u , italic_w ] ∗ italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT [ italic_u ] ∗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w ] >\displaystyle>>
M[u,w]\displaystyle M[u,w^{\prime}]*italic_M [ italic_u , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ∗ r0i[u]c1i1[w]subscript𝑟0𝑖delimited-[]𝑢subscript𝑐1𝑖1delimited-[]superscript𝑤\displaystyle r_{0\dots i}[u]*c_{1\dots i-1}[w^{\prime}]italic_r start_POSTSUBSCRIPT 0 … italic_i end_POSTSUBSCRIPT [ italic_u ] ∗ italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]
Li[u,w]subscript𝐿𝑖𝑢𝑤\displaystyle L_{i}[u,w]italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_w ] >Li[u,w]absentsubscript𝐿𝑖𝑢superscript𝑤\displaystyle>L_{i}[u,w^{\prime}]> italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_u , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]
Li(w|u)subscript𝐿𝑖conditional𝑤𝑢\displaystyle L_{i}(w|u)italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w | italic_u ) >Li(w|u).absentsubscript𝐿𝑖conditionalsuperscript𝑤𝑢\displaystyle>L_{i}(w^{\prime}|u)~{}~{}\square.> italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_u ) □ .

Thus, c1i1subscript𝑐1𝑖1c_{1\dots i-1}italic_c start_POSTSUBSCRIPT 1 … italic_i - 1 end_POSTSUBSCRIPT is the global ranking σLisubscript𝜎subscript𝐿𝑖\sigma_{L_{i}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as claimed ~{}~{}~{}~{}\blacksquare.

We check the our proof using simulations on 10000100001000010000 randomly generated boolean lexicons size ranging from 10×10101010\times 1010 × 10 to 20×20202020\times 2020 × 20, and running a chain of 100100100100 listeners on top. A total ordering can be found for all of them (Appendix B.1). We further study the stability of these ranks as they are formed, finding that the formed rankings tend to be stable across different RSA iterations (Appendix B.2).

7 Related Works

Scaling RSA without Global Ranking

Prior work such as that by Monroe et al. (2017) and Andreas & Klein (2016) has largely focused on sample and re-rank as a way of scaling RSA, making the example-dependent ranking function S1(u|w)subscript𝑆1conditional𝑢𝑤S_{1}(u|w)italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_u | italic_w ) more efficient at a cost of accuracy. Recent work by Key et al. (2022) and Vaduguru et al. (2024) apply the sample and re-rank approach to program synthesis, resulting in neural program synthesizers that also rank programs in an example-dependent way. Our work enables a different kind of synthesis algorithm altogether — that of a distilled pragmatic ranking that rank consistent programs agnostic to examples given. We view these works as complimentary, able to efficiently produce a simulated communication dataset D𝐷Ditalic_D which our approach can distill from.

Scaling RSA with Human Data

RSA has been applied to improve the performance of language interfaces in a variety of other domains, such as image description (Andreas & Klein, 2016; Cohn-Gordon et al., 2018a, b), instruction generation and interpretation (Fried et al., 2018a, b), and grounded interaction (Fried et al., 2021; Lin et al., 2022). These works all use speaker models trained on labeled data from people. Our approach requires no human-produced data, and can be run entirely from the lexicon M𝑀Mitalic_M of the synthesis problem. On the other hand, we can easily integrate human data within our approach by training similar speaker models on the collected interactive data.

Ranking Functions in Synthesis

Prior works on resolving ambiguity in program synthesis have relied on example-agnostic ranking functions. Works such as Singh & Gulwani (2015); Polozov & Gulwani (2015) use scoring functions to penalize certain properties of programs (e.g. discouraging the use of constants), effectively inducing a global ranking over all programs;  Ellis & Gulwani (2017) uses a set of hand-crafted features to learn a naturalistic ranking from data. Synthesis algorithms that use a large neural code model to sample a large number of programs (Chen et al., 2021; Li et al., 2022) implicitly rank the programs based on their naturalistic distributions in its training data. Our work is unique in that (1) the learned ranking is rooted in efficient communication rather than hand-crafted features and (2) our approach does not require human annotated data.

Other Theoretical Works on Ranking

Recent work by Muggleton FREng (2023) shows that in the case of single-example, the MAP estimate of the learner can be completely ranked by sz(H)+lng(H)𝑠𝑧𝐻𝑔𝐻sz(H)+\ln{g(H)}italic_s italic_z ( italic_H ) + roman_ln italic_g ( italic_H ) an example-agnostic global ranking. Our work can be viewed as a strict generalization in the following sense: They consider the chain of recursive bayesian reasoners of the form MS0L1𝑀subscript𝑆0subscript𝐿1M\rightarrow S_{0}\rightarrow L_{1}italic_M → italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, whereas our result applies to any alternating chains speakers and listeners of arbitrary depth. Their notion of “specificity” and “program length” also has direct analogies to the normalization terms in Equation 5, except these analogies do not carry over to deeper recursive depths.

8 Conclusion

We present a way of amortizing the expensive RSA algorithm by an example-agnostic global ranking. We have shown this amortization interacts well with humans when applied to two program synthesis domains. We have further proved this amortization is exact in the case of communication with a single example. In addition of being a practical method for scaling up RSA, these findings may provide an alternative account for pragmatic behaviour in humans – one rooted in relative rankings of hypotheses (e.g. a pragmatic prior), perhaps distilled from the expensive RSA computation over time.

8.1 Limitation and Future Directions

The limitation of our approach is two-fold: First, whether an optimal global ranking exists for the multi-example PBE setting; Second, whether our distillation algorithm can find this optimal ranking.

Existence of an effective global ranking

The effectiveness of a global ranking is upper-bounded by the amount of cycles that exists in the communicative dataset of example-dependent rankings of subsets of programs. A cycle exists if under one ranking we have wawbsucceedssubscript𝑤𝑎subscript𝑤𝑏w_{a}\succ w_{b}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≻ italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and under a different one we have wbwasucceedssubscript𝑤𝑏subscript𝑤𝑎w_{b}\succ w_{a}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≻ italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which no single ranking can approximate exactly. Forecasting the number of cycles from the meaning matrix M𝑀Mitalic_M is an exciting future work.

Effectiveness of distilling an effective global ranking

Our experiments have shown that given a communicative dataset, both the annealing (in the case of a small dataset) and neural scoring (in the case of a larger dataset) have their merits in deriving a ranking. Thus, running the slow RSA in the dataset generation itself is the likely bottleneck. We believe recent works by Key et al. (2022) and Vaduguru et al. (2024) using sample-and-rerank may be used in generating the communicative dataset instead of the exact RSA algorithm.

Impact Statement

This work builds a system where end-users may use examples to generate programs. While the proposed method is more intuitive to use by humans, it is possible that for some interactions, it may generate unexpected programs. Therefore, it could be of potential danger when humans do not manually verify the generated program, as it may have unintended outcomes when executed.

Acknowledgements

The authors would like to thank Kevin Ellis, Pei Wang, and Jesse Wang for preliminary explorations in this direction and insights to the proof. SV was partially supported by a gift from Autodesk Research. This material is based upon work supported by the NSF under Grant Nos. CCF-2123965 and IIS-2107391. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

References

  • Andreas & Klein (2016) Andreas, J. and Klein, D. Reasoning about pragmatics with neural listeners and speakers. arXiv preprint arXiv:1604.00562, 2016.
  • Balog et al. (2016) Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. ICLR, 2016.
  • Bergen et al. (2016) Bergen, L., Levy, R., and Goodman, N. Pragmatic reasoning through semantic inference. Semantics and Pragmatics, 9:ACCESS–ACCESS, 2016.
  • Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
  • Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021.
  • Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  • Cohn-Gordon et al. (2018a) Cohn-Gordon, R., Goodman, N., and Potts, C. Pragmatically informative image captioning with character-level inference. arXiv preprint arXiv:1804.05417, 2018a.
  • Cohn-Gordon et al. (2018b) Cohn-Gordon, R., Goodman, N. D., and Potts, C. An incremental iterated response model of pragmatics. arXiv preprint arXiv:1810.00367, 2018b.
  • Ellis & Gulwani (2017) Ellis, K. and Gulwani, S. Learning to learn programs from examples: Going beyond program structure. IJCAI, 2017.
  • Feser et al. (2015) Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from input-output examples. PLDI ’15, pp.  229–239, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334686. doi: 10.1145/2737924.2737977. URL https://doi.org/10.1145/2737924.2737977.
  • Frank & Goodman (2012) Frank, M. C. and Goodman, N. D. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998, 2012.
  • Franke & Degen (2016) Franke, M. and Degen, J. Reasoning in reference games: Individual-vs. population-level probabilistic modeling. PloS one, 11(5):e0154854, 2016.
  • Fried et al. (2018a) Fried, D., Andreas, J., and Klein, D. Unified pragmatic models for generating and following instructions. In Proceedings of North American Chapter of the Association for Computational Linguistics, 2018a.
  • Fried et al. (2018b) Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018b.
  • Fried et al. (2021) Fried, D., Chiu, J. T., and Klein, D. Reference-centric models for grounded collaborative dialogue. arXiv preprint arXiv:2109.05042, 2021.
  • Goodman & Frank (2016) Goodman, N. D. and Frank, M. C. Pragmatic language interpretation as probabilistic inference. Trends in cognitive sciences, 20(11):818–829, 2016.
  • Gulwani (2011) Gulwani, S. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, jan 2011. ISSN 0362-1340. doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423.
  • Key et al. (2022) Key, D., Li, W.-D., and Ellis, K. I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848, 2022.
  • Lewis (1979) Lewis, D. Scorekeeping in a language game. Journal of philosophical logic, 8:339–359, 1979.
  • Li et al. (2022) Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  • Lin et al. (2022) Lin, J., Fried, D., Klein, D., and Dragan, A. Inferring rewards from language in context. arXiv preprint arXiv:2204.02515, 2022.
  • Monroe & Potts (2015) Monroe, W. and Potts, C. Learning in the rational speech acts model. arXiv preprint arXiv:1510.06807, 2015.
  • Monroe et al. (2017) Monroe, W., Hawkins, R. X., Goodman, N. D., and Potts, C. Colors in context: A pragmatic neural model for grounded language understanding. Transactions of the Association for Computational Linguistics, 5:325–338, 2017.
  • Muggleton FREng (2023) Muggleton FREng, S. Hypothesizing an algorithm from one example: the role of specificity. Philosophical Transactions of the Royal Society A, 381(2251):20220046, 2023.
  • Polosukhin & Skidanov (2018) Polosukhin, I. and Skidanov, A. Neural program search: Solving programming tasks from description and examples. arXiv preprint arXiv:1802.04335, 2018.
  • Polozov & Gulwani (2015) Polozov, O. and Gulwani, S. Flashmeta: A framework for inductive program synthesis. ACM SIGPLAN Notices, 50(10):107–126, 2015.
  • Pu et al. (2020) Pu, Y., Ellis, K., Kryven, M., Tenenbaum, J., and Solar-Lezama, A. Program synthesis with pragmatic communication. Advances in Neural Information Processing Systems, 33:13249–13259, 2020.
  • Singh & Gulwani (2015) Singh, R. and Gulwani, S. Predicting a correct program in programming by example. In CAV, pp.  398–414. Springer, 2015.
  • Smith et al. (2013) Smith, N. J., Goodman, N., and Frank, M. Learning and using language via recursive pragmatic reasoning about other agents. Advances in neural information processing systems, 26, 2013.
  • Solar-Lezama (2008) Solar-Lezama, A. Program synthesis by sketching. PhD thesis, USA, 2008. AAI3353225.
  • Solar-Lezama et al. (2006) Solar-Lezama, A., Tancau, L., Bodik, R., Seshia, S., and Saraswat, V. Combinatorial sketching for finite programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pp.  404–415, 2006.
  • Vaduguru et al. (2022) Vaduguru, S., Ellis, K., and Pu, Y. Efficient pragmatic program synthesis with informative specifications, 2022.
  • Vaduguru et al. (2024) Vaduguru, S., Fried, D., and Pu, Y. Generating pragmatic examples to train neural program synthesizers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=yxKZGQLzOP.
  • Vaithilingam et al. (2023) Vaithilingam, P., Pu, Y., and Glassman, E. L. The usability of pragmatic communication in regular expression synthesis, 2023.
  • Vogel et al. (2013) Vogel, A., Bodoia, M., Potts, C., and Jurafsky, D. Emergence of gricean maxims from multi-agent decision theory. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.  1072–1081, 2013.

Appendix A Code and Assets

Please find all simulation, replay results at this repository https://github.com/evanthebouncy/pragmatic_synthesis_ranking/tree/main

Appendix B Simulated Studies

B.1 Ranking Always Exists

We empirically validate that in the case of single utterances, a ranking can always be found. See simulation/single_utter/exp_exists_orders.py

B.2 Stability of Ranks Across RSA Iterations

We’ve shown that for every L0,L1,subscript𝐿0subscript𝐿1L_{0},L_{1},\dotsitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , …, there exists a corresponding global, utterance agnostic ranking σL0,σL1,subscript𝜎subscript𝐿0subscript𝜎subscript𝐿1\sigma_{L_{0}},\sigma_{L_{1}},\dotsitalic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , …. We now explore the relationship between these rankings as a function of the RSA iteration i𝑖iitalic_i. Specifically, how stable is the relative ranks of w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT once it is formed?

Stable Order

A pair-wise order between w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is stable from iteration i𝑖iitalic_i onward if:

stable(i,ww)ji,i+1,,σLj[w]σLj[w]iff𝑠𝑡𝑎𝑏𝑙𝑒succeeds𝑖𝑤superscript𝑤succeedssubscript𝑗𝑖𝑖1subscript𝜎subscript𝐿𝑗delimited-[]𝑤subscript𝜎subscript𝐿𝑗delimited-[]superscript𝑤stable(i,w\succ w^{\prime})\iff\bigwedge_{j\in i,i+1,\dots,\infty}\sigma_{L_{j% }}[w]\succ\sigma_{L_{j}}[w^{\prime}]italic_s italic_t italic_a italic_b italic_l italic_e ( italic_i , italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⇔ ⋀ start_POSTSUBSCRIPT italic_j ∈ italic_i , italic_i + 1 , … , ∞ end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ] ≻ italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

Which means the relative ranking of σLi[w]σLi[w]succeedssubscript𝜎subscript𝐿𝑖delimited-[]𝑤subscript𝜎subscript𝐿𝑖delimited-[]superscript𝑤\sigma_{L_{i}}[w]\succ\sigma_{L_{i}}[w^{\prime}]italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ] ≻ italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] holds true for every subsequent iterations until σLsubscript𝜎subscript𝐿\sigma_{L_{\infty}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Let the minimal-index of a stable pair-wise ordering be the first iteration i𝑖iitalic_i such that wwsucceeds𝑤superscript𝑤w\succ w^{\prime}italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT becomes stable:

imin(ww)=argminjstable(j,ww)subscript𝑖succeeds𝑤superscript𝑤subscriptargmin𝑗𝑠𝑡𝑎𝑏𝑙𝑒succeeds𝑗𝑤𝑤\displaystyle i_{\min}(w\succ w^{\prime})=\text{argmin}_{j}stable(j,w\succ w)italic_i start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = argmin start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_s italic_t italic_a italic_b italic_l italic_e ( italic_j , italic_w ≻ italic_w ) (6)

As σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the first time any ranking can exist (L0subscript𝐿0L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a uniform distribution over valid hypotheses, i.e. no rankings), we explore the following: For a lexicon M𝑀Mitalic_M, what fraction of stable orderings have a minimal-index of 1?

frac-stableL1(M)=|{ww|imin(ww)=1}||{ww|i.stable(i,ww)}|subscriptfrac-stablesubscript𝐿1𝑀conditional-setsucceeds𝑤superscript𝑤subscript𝑖succeeds𝑤superscript𝑤1conditional-setsucceeds𝑤superscript𝑤formulae-sequence𝑖𝑠𝑡𝑎𝑏𝑙𝑒succeeds𝑖𝑤superscript𝑤\displaystyle\text{frac-stable}_{L_{1}}(M)=\frac{|\{w\succ w^{\prime}~{}|~{}i_% {\min}(w\succ w^{\prime})=1\}|}{|\{w\succ w^{\prime}~{}|~{}\exists i.~{}stable% (i,w\succ w^{\prime})\}|}frac-stable start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) = divide start_ARG | { italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_i start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 } | end_ARG start_ARG | { italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ∃ italic_i . italic_s italic_t italic_a italic_b italic_l italic_e ( italic_i , italic_w ≻ italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } | end_ARG (7)

Simulation

We measure stableL1(M)𝑠𝑡𝑎𝑏𝑙subscript𝑒subscript𝐿1𝑀stable_{L_{1}}(M)italic_s italic_t italic_a italic_b italic_l italic_e start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) on a population of sampled random boolean lexicons. We sample square lexicons of size lexicon_size2×2100×100𝑙𝑒𝑥𝑖𝑐𝑜𝑛_𝑠𝑖𝑧𝑒22100100lexicon\_size\in 2\times 2\dots 100\times 100italic_l italic_e italic_x italic_i italic_c italic_o italic_n _ italic_s italic_i italic_z italic_e ∈ 2 × 2 … 100 × 100. Each lexicon is sampled with Ptrue{0.1,0.2,0.5}𝑃𝑡𝑟𝑢𝑒0.10.20.5Ptrue\in\{0.1,0.2,0.5\}italic_P italic_t italic_r italic_u italic_e ∈ { 0.1 , 0.2 , 0.5 }, where larger value of Ptrue𝑃𝑡𝑟𝑢𝑒Ptrueitalic_P italic_t italic_r italic_u italic_e makes the lexicon have more 1s. We make sure each sampled lexicon is valid in the following sense: (1) all rows are unique – every utterance must communicate a unique subset of valid hypotheses (2) all columns are unique – every hypothesis has a unique set of utterances that can refer to it. For every combination of (Ptrue,lexicon_size)𝑃𝑡𝑟𝑢𝑒𝑙𝑒𝑥𝑖𝑐𝑜𝑛_𝑠𝑖𝑧𝑒(Ptrue,lexicon\_size)( italic_P italic_t italic_r italic_u italic_e , italic_l italic_e italic_x italic_i italic_c italic_o italic_n _ italic_s italic_i italic_z italic_e ) we randomly sample 100 lexicons. As it is infeasible to run RSA until iteration \infty, we run RSA for 100 iterations for each lexicon (i.e. L100Lsubscript𝐿100subscript𝐿L_{100}\approx L_{\infty}italic_L start_POSTSUBSCRIPT 100 end_POSTSUBSCRIPT ≈ italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT). We measure stableL1𝑠𝑡𝑎𝑏𝑙subscript𝑒subscript𝐿1stable_{L_{1}}italic_s italic_t italic_a italic_b italic_l italic_e start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for each sampled lexicon. The result is shown in 9. As we can see, of all the stable pair-wise orderings, a large fraction (>0.8absent0.8>0.8> 0.8) are formed during σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, this is increasingly true as we (1) increase Ptrue𝑃𝑡𝑟𝑢𝑒Ptrueitalic_P italic_t italic_r italic_u italic_e, making the boolean lexicons having more number of 1s – i.e. the lexicon is more ambiguous for a literal speaker and listener and (2) increase lexicon_size𝑙𝑒𝑥𝑖𝑐𝑜𝑛_𝑠𝑖𝑧𝑒lexicon\_sizeitalic_l italic_e italic_x italic_i italic_c italic_o italic_n _ italic_s italic_i italic_z italic_e. We suspect this is due to faster “mixing time” of the RSA algorithm under these conditions, but this is just a guess.

Takeaway This study may provide an alternative explanation as to why humans do not perform RSA for more than few iterations (Franke & Degen, 2016). In addition to it being computationally expensive, it is also not necessary as the majority of top-k orderings becomes available at σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and remains stable for all subsequent iterations of the RSA algorithm. In another word, L1topkLi>1topksuperscriptsubscript𝐿1𝑡𝑜𝑝𝑘superscriptsubscript𝐿𝑖1𝑡𝑜𝑝𝑘L_{1}^{top-k}\cong L_{i>1}^{top-k}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p - italic_k end_POSTSUPERSCRIPT ≅ italic_L start_POSTSUBSCRIPT italic_i > 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_p - italic_k end_POSTSUPERSCRIPT. Code in simulation/single_utter

Refer to caption
Figure 9: Fraction of stable orders that were formed in σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as a function of increasing lexicon size. Points are raw samples (n=100 per lexicon size and Ptrue), bars are 95% bootstrapped CI (nboot = 1000). Overall, increasing Ptrue and lexicon size increases the fraction of stable orders that were formed in σL1subscript𝜎subscript𝐿1\sigma_{L_{1}}italic_σ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT

Appendix C Animals domain

In the Animals domain, a program is a pattern on a grid formed from a set of objects. These objects may be a colourless pebble, or a chicken or pig that may be red, green or blue. An utterance reveals one square on the grid, and the speaker has to communicate the pattern by choosing which square to reveal. The pattern is formed according to rules specified in the domain-specific language in Figure 10. Examples of programs shown in Figure 11. The description of the domain-specific language and the examples are due to Vaduguru et al. (2022).

Appendix D Human study interface

The interface for the human study on regular expression programs is shown in Figure 12.

ProgramProgramabsent\displaystyle\texttt{Program}\toProgram → Shape, Colourdelimited-⟨⟩Shape, Colour\displaystyle\ \langle\texttt{Shape, Colour}\rangle⟨ Shape, Colour ⟩
ShapeShapeabsent\displaystyle\texttt{Shape}\toShape → Box(Left, Right, Top, Bottom, Thickness, Outside, Inside)
LeftLeftabsent\displaystyle\texttt{Left}\toLeft → 0 | 1 | 2 | 3 | ... | 6
RightRightabsent\displaystyle\texttt{Right}\toRight → 0 | 1 | 2 | 3 | ... | 6
TopTopabsent\displaystyle\texttt{Top}\toTop → 0 | 1 | 2 | 3 | ... | 6
BottomBottomabsent\displaystyle\texttt{Bottom}\toBottom → 0 | 1 | 2 | 3 | ... | 6
ThicknessThicknessabsent\displaystyle\texttt{Thickness}\toThickness → 1 | 2 | 3
OOabsent\displaystyle\texttt{O}\toO → chicken | pig
IIabsent\displaystyle\texttt{I}\toI → chicken | pig | pebble
ColourColourabsent\displaystyle\texttt{Colour}\toColour → [red , green , blue][A2(A1)]subscript[red , green , blue][A2subscript(A1)]\displaystyle\ \texttt{[red , green , blue][A}_{2}\texttt{(A}_{1}\texttt{)]}[red , green , blue][A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )]
A1subscriptA1absent\displaystyle\texttt{A}_{1}\toA start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → x | y | x + y
A2subscriptA2absent\displaystyle\texttt{A}_{2}\toA start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → λz:0|λz:1|λz:2|λz:z%2|λz:z%2+1|λz:2*(z%2)𝜆z:0|𝜆z:1|𝜆z:2|𝜆z:z%2|𝜆z:z%2+1|𝜆z:2*(z%2)\displaystyle\ \lambda\texttt{z:0}\texttt{|}\lambda\texttt{z:1}\texttt{|}% \lambda\texttt{z:2}\texttt{|}\lambda\texttt{z:z\%2}\texttt{|}\lambda\texttt{z:% z\%2+1}\texttt{|}\lambda\texttt{z:2*(z\%2)}italic_λ typewriter_z:0 typewriter_| italic_λ typewriter_z:1 typewriter_| italic_λ typewriter_z:2 typewriter_| italic_λ typewriter_z:z%2 typewriter_| italic_λ typewriter_z:z%2+1 typewriter_| italic_λ z:2*(z%2)
Figure 10: Grammar of the DSL
Refer to caption
(a) [,,1,5,1,6,2,chicken,pebble,,x,λz:z%2]15162chickenpebblex𝜆z:z%2[\cdot,\cdot,\framebox{{1}},\texttt{5},\texttt{1},\texttt{6},\texttt{2},% \framebox{{chicken}},\texttt{pebble},\cdot,\framebox{{x}},\lambda\texttt{z:z\%% 2}][ ⋅ , ⋅ , 1 , 5 , 1 , 6 , 2 , chicken , pebble , ⋅ , x , italic_λ z:z%2 ]
Refer to caption
(b) [,,0,5,1,6,2,pig,pebble,,y,λz:z%2]05162pigpebbley𝜆z:z%2[\cdot,\cdot,\framebox{{0}},\texttt{5},\texttt{1},\texttt{6},\texttt{2},% \framebox{{pig}},\texttt{pebble},\cdot,\framebox{{y}},\lambda\texttt{z:z\%2}][ ⋅ , ⋅ , 0 , 5 , 1 , 6 , 2 , pig , pebble , ⋅ , y , italic_λ z:z%2 ]
Figure 11: Two patterns in our layout domain and their corresponding programs, represented as a sequence of production rules: [Program, Shape, Left, Right, Top, Bottom, Thickness, O, I, Colour, A1, A2]. The symbol \cdot indicates rules which only have 1 choice of expansion (Program, Shape, and Colour). The rules where these two programs differ are marked with a box.
Refer to caption
Figure 12: User interface for the regex domain

Appendix E Neural model

The neural scoring model maps from the program to a real number. The program is input as vector encoding the productions of the grammar that produce the program. That is, we construct a vector of the index of the production that is used to expand each non-terminal in the DSL grammar. We then convert this vector to a one-hot matrix. There are 12 rules, with any single rule having at most 7 possible expansions resulting in an input vector of dimension 12 ×\times× 7 === 84. The input is then passed through 3 hidden layers of size 128, each of which has as ReLU activation, and then mapped to a scalar output with a linear layer.

The model is trained on a dataset of rankings of the form 𝒟=(w,𝐮,σ~𝐮)𝒟𝑤𝐮subscript~𝜎𝐮\mathcal{D}={(w,\mathbf{u},\tilde{\sigma}_{\mathbf{u}})}caligraphic_D = ( italic_w , bold_u , over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ). For each program w𝑤witalic_w, we sample a pair of programs from the inferred ranking σ~𝐮subscript~𝜎𝐮\tilde{\sigma}_{\mathbf{u}}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT and use this pair to compute the loss function for this sample. We train the model for a maximum of 20 epochs, where one epoch of training corresponds to presenting the model with every element in 𝒟𝒟\mathcal{D}caligraphic_D once. We train with a batch size of 32 using the Adam optimizer. We use a validation set generated similarly to 𝒟𝒟\mathcal{D}caligraphic_D (on a disjoint set of programs) to perform validation, choosing the model that results in the highest synthesis accuracy on this validation dataset with synthetically produced examples (from the S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT speaker model).

We train an ensemble of 10 models. For each model, we normalize the scores to be of zero mean and unit variance based on the empirical mean and standard deviation computed on the validation set. We then average the scores for the 10 models at inference time.