Amortizing Pragmatic Program Synthesis with Rankings
Abstract
The usage of Rational Speech Acts (RSA) framework has been successful in building pragmatic program synthesizers that return programs which, in addition to being logically consistent with user-generated examples, account for the fact that a user chooses their examples informatively. We present a general method of amortizing the slow, exact RSA synthesizer. Our method first query the exact RSA synthesizer to compile a communication dataset. The dataset contains a number of example-dependent rankings of subsets of programs. It then distills a single global ranking of all programs as an approximation to every ranking in the dataset. This global ranking is then used at inference time to rank multiple logically consistent candidate programs generated from a fast, non-pragmatic synthesizer. Experiments on two program synthesis domains using our ranking method resulted in orders of magnitudes of speed ups compared to the exact RSA synthesizer, while being more accurate than a non-pragmatic synthesizer when communicating with humans. Finally, we prove that in the special case of synthesis from a single example, this approximation is exact.
1 Introduction
For intelligent systems to be accessible to end users, it is important that they can infer the user’s intent under ambiguity. Imagine a person asking an AI assistant to generate a regular expression that matches the string 123-7890. It would be unhelpful if the AI assistant simply returned the regular expression – the expression that matches all strings – although it is technically correct. The rational speech acts model (RSA) of pragmatics (Frank & Goodman, 2012) gives an algorithm for resolving ambiguities by modeling the user as a speaker that chooses informative examples for the system, via recursive Bayesian reasoning. Given several competing responses, for instance = \d{3}-\d{4} and = , RSA would reason that it is more likely that an informative user would use the example 123-7890 to describe over , allowing it to prefer the intended regex. Recent works (Pu et al., 2020; Vaithilingam et al., 2023) have leveraged the RSA algorithm to build pragmatic program synthesizers – interactive systems that take in user given examples (e.g. strings) and return programs (e.g. regexes) that are both logically consistent and take into account the informativity of the chosen examples. Their algorithm, which we refer to as RSA, is applicable to any program synthesis domain where programs can be efficiently enumerated (Feser et al., 2015; Solar-Lezama, 2008; Gulwani, 2011), and produces a pragmatic synthesizer which interacts well with humans, while requiring no labeled human data.
The RSA algorithm marginalizes across all possible examples (e.g. all strings) and programs (e.g. all regexes) multiple times. This makes it difficult to scale RSA to large domains, where users expect the system to complete its inference in real-time. Prior works in scaling up RSA computation (Monroe et al., 2017; Andreas & Klein, 2016) have largely focused on sampling and re-ranking, curbing RSA’s computation to a small subset of programs and examples. In this work, we show a simple yet effective way of amortizing RSA via a single global ranking of all programs. Rather than using RSA directly at inference time, our method uses it to generate training data in the form example-dependent rankings of subsets of programs. We then distill a global ranking from the training data, amortizing the computation of RSA (Figure 1). At inference time, a fast, non-pragmatic synthesizer is used to propose multiple logically consistent programs, and the global ranking is used to quickly rank them,111In our example, the regex would be ranked lower than other consistent programs. resulting in a pragmatic yet efficient synthesizer.
This work makes the following contributions. (1) We describe a general method of amortizing the RSA algorithm (considered in Cohn-Gordon et al. (2018b); Pu et al. (2020); Vaithilingam et al. (2023)) applicable to any pragmatic program synthesis domains. (2) Using global ranking, we scale the model proposed by Vaithilingam et al. (2023) to a larger domain while still allowing for real-time interaction. We conduct a small user study validating that end-users are more accurate communicating with a ranking based program synthesizer compared to a non-pragmatic one (+27%, +41% relative). (3) We conduct simulated user studies by replaying the human interactive synthesis data collected from Pu et al. (2020) and Vaithilingam et al. (2023). We confirm that our ranking-based synthesizer retains the communicative accuracy of RSA (55%, 92% respectively), while running orders of magnitudes(over 100 times) faster. (4) We prove that in the special case of synthesis from just a single example, RSA_single, a setting studied in the original RSA literature (Goodman & Frank, 2016; Vogel et al., 2013; Monroe & Potts, 2015; Smith et al., 2013), the approximation using a global ranking is exact.
2 Background on Pragmatic Synthesis
In this section, we provide background on a reference game framework of program synthesis, which affords building a pragmatic synthesizer that can infer a user’s intended program from few examples (Pu et al., 2020). We illustrate this framework using a toy example from a small version of the regular expression domain of this work.
2.1 Synthesis as a Reference Game
Consider the problem where a user gives example strings to a synthesis system, and asks it to find a matching regular expression. This process can be modeled as a reference game (Lewis, 1979), where a speaker (the user) chooses a few utterances (strings) to give to the listener (the synthesizer), with the intention that the listener can infer the correct hypothesis (regular expression). This reference game is characterized by the lexicon , a boolean matrix of 1s and 0s (Figure 2). In , each row corresponds to an utterance/example and each column corresponds to a hypothesis/program, and 1s indicating consistency of its corresponding utterance and a hypothesis: whether the program’s output (e.g. deciding whether a regular expression matches a string) is consistent with the example (e.g. the string). As we can see, a given utterance (such as 001) may be consistent with multiple hypotheses (0+{1}, 0{2}1+, and 0+1*).
2.2 A Literal Program Synthesizer
How might we build a system that takes an utterance (say 01) and produces the intended hypothesis 0+1{1}? As 01 is consistent with multiple hypotheses (0+1{1} and 0+1*), a naive strategy is to treat all consistent hypotheses as equally likely, scaled by a prior distribution of hypotheses :
(1) | ||||
(2) |
A synthesizer built this way is a literal listener (Bergen et al., 2016). Assuming the prior is uniform over programs, we can construct it by normalizing the rows of the matrix , resulting in a probability distribution over hypotheses given utterances (Figure 2). As we can see, given the utterance 01, this listener predicts an equal probability of 0+1{1} and 0+1* being the intended program.
2.3 A Pragmatic Synthesizer from a Single Example
A key insight to improving on the literal synthesizer is to consider that a user is cooperatively choosing an utterance to be informative about the intended program to the synthesizer. The Rational Speech Acts (RSA) framework models this informative choice of utterances using recursive Bayesian reasoning (Frank & Goodman, 2012). By reasoning about why a speaker (user) might have chosen a particular utterance (examples), rather than possible alternatives, the listener (synthesizer) can disambiguate the hypothesis (program) to which the speaker was referring to. Formally, the RSA framework produces a chain of alternating listeners and speakers beginning with the model above.
(3) |
Applying this framework amounts to normalizing the columns of the matrix to obtain a pragmatic speaker distribution , then normalizing the rows of to obtain a pragmatic listener (synthesizer), (Figure 2). As we can see, given the utterance 01, this listener prefers 0+1{1} over 0+1*, reflecting the reasoning that if the user wanted to refer to 0+1*, they might have provided an example that highlights the possibility of no 1s in the string. In this paper, we shall call this algorithm RSA_single. As this algorithm only depends on , it is applicable to all program synthesis domains where programs and examples can be effectively enumerated.
2.4 A Pragmatic Synthesizer from Multiple Examples
RSA_single is capable of producing a program synthesis algorithm from a single example. However, the users will typically have to clarify their intent interactively, by giving a sequence of multiple utterances . The synthesizer must infer the intended program after every turn. With each new utterance, the meaning matrix becomes smaller, as hypotheses inconsistent with the new utterance are ruled out (Figure 3). This is an instance of incremental RSA (Cohn-Gordon et al., 2018b), which models the informative speaker generating utterances auto-regressively:
In essense, the is the product of multiple single-utterance computed on separate meaning matrixes (like those in Figure 3). The synthesizer is defined recursively on top of , .
Pu et al. (2020) builds on top of the incremental RSA algorithm with additional memoization strategies. In this work, we shall call their algorithm RSA. Similar to RSA_single, this algorithm is applicable to enumerative program synthesis domains such as Feser et al. (2015); Solar-Lezama (2008); Gulwani (2011).
2.5 Exact RSA is Slow
In practice, it is infeasible to explicitly store the matrices . Instead, computing using RSA requires calls to . Each call to compute requires calls to , which in turn requires operations to determine a set of consistent programs. In practice, the pragmatic synthesizer runs in time. In the incremental RSA setting with multiple (say ) utterances, the runtime of is . As the number of hypotheses and utterances becomes large in a program synthesis domain, it becomes infeasible to compute at a speed required for end-user interactions.
3 Amortizing RSA with Rankings
We explain how the pragmatic listener , derived from the RSA algorithm can be amortized using a single global ranking of programs.
Finding Consistent Programs
Finding correct programs given a sequence of examples is the primary challenge of program synthesis, with solutions ranging from enumeration (Feser et al., 2015), constraint solving (Solar-Lezama et al., 2006), neuro-symbolic (Polosukhin & Skidanov, 2018; Balog et al., 2016), and using large language models for code (Li et al., 2022). In this work, we assume the a set of consistent programs can be found using any of these techniques.
Ranking Consistent Programs with a Prior
A global ranking is an un-normalized prior (a score) over all programs. The global ranking is example-agnostic: given two programs and , either or , irrespective of the given examples .
As we can see, ranking the consistent programs under can be very efficient. In practice, efficient synthesis algorithms are built using either domain-specific heuristics for rankings (Singh & Gulwani, 2015; Polozov & Gulwani, 2015), or a learned prior from a code corpus (Li et al., 2022).
Ranking with
Rather than relying on heuristics or learning from a large corpus, RSA automatically derives a ranked synthesizer :
To rank the consistent programs, uses , an example-dependent ranking function, that ranks the satisfying programs differently depending on the sequences of examples given. In this setting with multiple examples, there could be cycles where a pair222or a triple or larger cycles of satisfying programs and , which is ranked under some examples and ranked given different examples . In this work, we assume that can be tractably computed at non-interactive speed.
Amortizing with a Ranking
In this work, we explore whether the example-dependent ranking of can be approximated — to have similar top- responses — with an example-agnostic ranking function . Note that due to the existence of cycles, it may be impossible to find a global ranking that is consistent with all example-dependent rankings. Our key findings are as follows:
Key Finding 1: One can distill a pragmatic ranking from . While this is an approximation, it nonetheless retains much of the ’s communicative accuracy when interacting with end-users, and running orders of magnetudes faster.
Key Finding 2: In the special case where only a single example is used, RSA_single, the approximation can be made exact: There exists a global ranking that perfectly matches the top-k responses of over any example .
4 Distilling of RSA to a Global Ranking
Distilling the example-dependent rankings into a global ranking has two stages. First, we generate a dataset of , where is a program, is a specification (sequence of examples) used to describe , and are the example-dependent rankings of consistent programs given .333it is a mouthful, we are terribly sorry Then, we distill a global ranking that aggregates the example-dependent rankings in .
4.1 Dataset Generation via Simulated Communications
The pragmatic listener can generate a partial ranking of consistent programs for any sequences of examples . As arbitrary examples are unlikely to reflect what a user might give at inference time, we use the informative speaker as a “stand-in”. Specifically, we generate in a form of simulated interactions between the pragmatic speaker and the pragmatic listener . We enumerate over the set of programs , then use the pragmatic speaker to sample the most likely specifications (sequence of examples) of length to length . For each specification, we query for a partial ranking of consistent programs, and add it to the dataset (Algorithm 1).
4.2 Distillation via Annealing
The most straight-forward representation of a ranking is as an explicit list of programs . We describe a process of finding an approximate global ranking using annealing. We repeatedly sample example-dependent rankings from , and update the global ranking to match for a single pair of programs sampled from . Since cycles exist in example-dependent rankings, we terminate the annealing procedure once the number of swaps in a sliding window has stabilized (Algorithm 2). The resulting is then used at inference time.
4.3 Distillation via Learning a Score Function
An alternative method to distill is to train a score function that determines a score for a program that is independent of the specifications . We can optimize to minimize disagreement with the generated dataset of example-dependent rankings, by minimizing the loss
where is the sigmoid function. This follows estimating a score function from a set of pairwise preferences (Bradley & Terry, 1952; Christiano et al., 2017). We parametrize as a small neural network that scores programs. To reduce variance, we fit an ensemble of score functions and use their average to rank the consistent programs at inference time (Christiano et al., 2017). Details of the neural models are in Appendix E.
5 Experiments
To validate the accuracy and run-time of an approximate ranking listener, we perform two sets of experiments. First, we conduct a small () human experiment by building a ranking-based synthesizer in a regular expression synthesis domain where it is infeasible to run the RSA algorithm at interaction time. Second, we conduct two replay studies by simulating virtual users giving examples one after another using human interaction data collected from prior works. We seek to answer the following questions: (Q1) Can ranking based synthesizers accurately infer programs from humans (both in live interaction and in simulated replays)? (Q2) Are ranking-based synthesizers fast to run when compared to and ?
Metrics
In our experiments, the users (real or simulated) will be given a target program, and attempt to communicate it to the synthesizers using examples. The synthesizers will be measured on their communication accuracy — whether the synthesizers can infer the target program from the examples given. A synthesizer is better than another if it can recover the target program using fewer examples.
5.1 Interactive User Study
We conduct a user study where people interacted with both the ranking-based synthesizer distilled with annealing and the literal synthesizer on the domain of regular expression synthesis.
The Regex Domain
The regex domain is a scaled up version of Vaithilingam et al. (2023), which has a total of 350 regular expressions from their grammar (Figure 4. For this study, we expanded the space of programs to 3500 regular expressions from the same grammar – a setting that would make live interaction infeasible running with RSA.
Procedure
We recruited 8 participants from our institution. Each participant was given a short tutorial on how to use the interface, then attempted to communicate a total of 4 regexes using examples. For each regex, the participant communicated with both the literal synthesizer and the ranking synthesizer , anonymized as simply a “green robot” and a “blue robot” in randomized order. The participants gave example strings one at a time until the regex is recovered by the synthesizer, or they may give up early. The communication is interactive: When the participant added a new example, they were immediately shown the current top-1 guess of the synthesizer, which allowed them to choose the next example accordingly.
Results: end-users interact well with an amortized ranking synthesizer (Q1)
Figure 5 shows the communication success rate over numbers of given exmaples (turns) for both the literal and ranking-based synthesizers. We can see that (1) has a higher overall success rate with humans, and (2) It also achieves a higher success rate with fewer number of examples (Q1).
5.2 Simulated User Studies Using Replays
We evaluate the ranking-based synthesizers by replaying the interaction data collected from Vaithilingam et al. (2023) and Pu et al. (2020) – small pragmatic program synthesis domains where it is feasible to run with RSA.
Replay Data
In the human studies by Vaithilingam et al. (2023) and Pu et al. (2020), a human is given a target program , and attempt to get the synthesizer ( or ) to infer the target using a sequence of examples . Thus, two sets of data are generated, one where the human is interacting with the literal synthesizer , which we term , and one where the human is interacting with the pragmatic synthesizer , which we term . Specifically, from each domain we extract the following dataset . Here, are the set of programs used for the human study (the stimuli), is the set of participants, and indicates if the participant is communicating with or .
Experiment Setup
We can simulate an user interaction by using the replay data. Given a datapoint , we create a simulated user that iteratively gives the examples in multiple turns to communicate a given target program . At every turn, the synthesizer returns the top-1 responses, , and we can check if any of them matches the target program . If they do, we mark the communication as successful and stop early. Otherwise, we keep adding examples until the runs out, and we mark the communication as unsuccessful. Note that our evaluation cannot account for a user adapting their choice of examples to , as the simulated user can only give scripted examples according to the replay data.
Domain 1: Animals
Pu et al. (2020) used a domain of grid patterns generated by an underlying domain-specific language (see Appendix for the grammar of the DSL and semantics). The space contains 17,976 semantically distinct programs and 343 possible examples, where a user uses a sequence of multiple examples to communicate a target program. They conducted a study with 48 human subjects, collecting data for 10 programs (10 distinct grid patters). The data includes interactions between humans and both a literal synthesizer () and a pragmatic synthesizer (). In total, there are 254 interactions from and 291 interactions from from , where each interaction consists of multiple turns until either the target program is successfully communicated or the user gives up.
Domain 2: Regular expressions
Vaithilingam et al. (2023) studied the usability of pragmatic program synthesizers in the domain of binary regular expressions. The space contains 350 distinct regular expressions. A sample of 2000 strings was used to compute the and distributions. Their study included 30 participants interacting with both and models. In total, there are 60 interactions from and 60 interactions from from , where each consisting of multiple turns.
Result: rank-based synthesizers are comparable to in terms of communication accuracy with simulated users (Q1)
The replay study results are shown in Figure 6 (animals domain) and Figure 7 (regex domain). For either domain, there is a rank-based synthesizer that vastly out-performs the literal synthesizer , and is close to performance to the pragmatic synthesizer derived from RSA.
The existence of a rank-based synthesizer (be it or ) that matches the performance of entails that there exists some ranking of programs that effectively amortizes for either domain. For the animals domain, is better able to discover an effective ranking, while is more effective at discovering the ranking for the regex domain. This is likely due to the differences of the sizes of the communicative datasets for the two domains — 17,976 programs for the animals domain vs 350 for the animals domain, which makes it more feasible to learn a generalizable neural scoring function for the animals domain.
Result: rank-based synthesizers are orders of magnetudes faster than (Q2)
For both domains, the ranking-based synthesizer is much faster than , requiring approximately the same time as (Figure 8). This implies that most of the computation cost of a ranking-based synthesizer lies in coming up with consistent programs — the primary challenge of program synthesis — while the computation for ranking the top- programs can be made negligible in comparison (Q2).
6 RSA_single Can Be Distilled Completely
In this section, we prove a strong approximation result for a special case of RSA, RSA_single, where only a single example is used to communicate. In accordance with the terminologies of Goodman & Frank (2016); Vogel et al. (2013); Monroe & Potts (2015); Smith et al. (2013) and Franke & Degen (2016), we’ll use the term “hypothesis” instead of “program”. We prove that a global pragmatic ranking of hypotheses must exist for any listeners resulting from the RSA_single algorithm.444one can derive the same result for pragmatic ranking of speakers by taking a transpose of In other words, the rankings over consistent hypotheses in these listeners are example-agnostic.
Theorem:
For a sequence of listeners in the RSA algorithm over a boolean-valued lexicon , there exists a sequence of global pragmatic rankings such that:
(4) |
This means the partial rankings produced by any over consistent hypotheses are example-agnostic, where a global ranking preferring certain hypotheses unconditionally over others (e.g. a convention) is sufficient to explain the relative rankings of resulting from RSA_single.
Proof:
Let be a boolean lexicon of size rows and columns. Let be the row-normalizing vector such that , which is to say, each element is the normalization term for row of . Let denotes row-wise multiplication:
Which is to say, starting from , can be obtained by scaling each row by their respective normalization constant . Let be the col-normalizing vector such that , which is to say, each element is the normalization term for column of . Similarly, let denotes column-wise multiplication
Computing under RSA amounts to applying row and column normalization alternatively multiple times:
Let be element-wise multiplication, let be outer-product, we can rearrange the terms:
(5) |
Here, is a vector of size , and is a vector of size . As we can see, following the RSA algorithm, can be decomposed to to multiplication of 2 parts: the lexicon , and a matrix that is formed by the outer product 555note that any prior over hypotheses and utterance can be similarly absorbed into these outer products terms.
Claim: The ordered indexes of is the global pragmatic ranking :
Proof: We show both sides of the . Suppose that for some , both and (i.e. ).
(1) Show : Suppose . We have
As is a constant, we have
(2) Show : Suppose .
Thus, is the global ranking as claimed .
We check the our proof using simulations on randomly generated boolean lexicons size ranging from to , and running a chain of listeners on top. A total ordering can be found for all of them (Appendix B.1). We further study the stability of these ranks as they are formed, finding that the formed rankings tend to be stable across different RSA iterations (Appendix B.2).
7 Related Works
Scaling RSA without Global Ranking
Prior work such as that by Monroe et al. (2017) and Andreas & Klein (2016) has largely focused on sample and re-rank as a way of scaling RSA, making the example-dependent ranking function more efficient at a cost of accuracy. Recent work by Key et al. (2022) and Vaduguru et al. (2024) apply the sample and re-rank approach to program synthesis, resulting in neural program synthesizers that also rank programs in an example-dependent way. Our work enables a different kind of synthesis algorithm altogether — that of a distilled pragmatic ranking that rank consistent programs agnostic to examples given. We view these works as complimentary, able to efficiently produce a simulated communication dataset which our approach can distill from.
Scaling RSA with Human Data
RSA has been applied to improve the performance of language interfaces in a variety of other domains, such as image description (Andreas & Klein, 2016; Cohn-Gordon et al., 2018a, b), instruction generation and interpretation (Fried et al., 2018a, b), and grounded interaction (Fried et al., 2021; Lin et al., 2022). These works all use speaker models trained on labeled data from people. Our approach requires no human-produced data, and can be run entirely from the lexicon of the synthesis problem. On the other hand, we can easily integrate human data within our approach by training similar speaker models on the collected interactive data.
Ranking Functions in Synthesis
Prior works on resolving ambiguity in program synthesis have relied on example-agnostic ranking functions. Works such as Singh & Gulwani (2015); Polozov & Gulwani (2015) use scoring functions to penalize certain properties of programs (e.g. discouraging the use of constants), effectively inducing a global ranking over all programs; Ellis & Gulwani (2017) uses a set of hand-crafted features to learn a naturalistic ranking from data. Synthesis algorithms that use a large neural code model to sample a large number of programs (Chen et al., 2021; Li et al., 2022) implicitly rank the programs based on their naturalistic distributions in its training data. Our work is unique in that (1) the learned ranking is rooted in efficient communication rather than hand-crafted features and (2) our approach does not require human annotated data.
Other Theoretical Works on Ranking
Recent work by Muggleton FREng (2023) shows that in the case of single-example, the MAP estimate of the learner can be completely ranked by an example-agnostic global ranking. Our work can be viewed as a strict generalization in the following sense: They consider the chain of recursive bayesian reasoners of the form , whereas our result applies to any alternating chains speakers and listeners of arbitrary depth. Their notion of “specificity” and “program length” also has direct analogies to the normalization terms in Equation 5, except these analogies do not carry over to deeper recursive depths.
8 Conclusion
We present a way of amortizing the expensive RSA algorithm by an example-agnostic global ranking. We have shown this amortization interacts well with humans when applied to two program synthesis domains. We have further proved this amortization is exact in the case of communication with a single example. In addition of being a practical method for scaling up RSA, these findings may provide an alternative account for pragmatic behaviour in humans – one rooted in relative rankings of hypotheses (e.g. a pragmatic prior), perhaps distilled from the expensive RSA computation over time.
8.1 Limitation and Future Directions
The limitation of our approach is two-fold: First, whether an optimal global ranking exists for the multi-example PBE setting; Second, whether our distillation algorithm can find this optimal ranking.
Existence of an effective global ranking
The effectiveness of a global ranking is upper-bounded by the amount of cycles that exists in the communicative dataset of example-dependent rankings of subsets of programs. A cycle exists if under one ranking we have , and under a different one we have , which no single ranking can approximate exactly. Forecasting the number of cycles from the meaning matrix is an exciting future work.
Effectiveness of distilling an effective global ranking
Our experiments have shown that given a communicative dataset, both the annealing (in the case of a small dataset) and neural scoring (in the case of a larger dataset) have their merits in deriving a ranking. Thus, running the slow RSA in the dataset generation itself is the likely bottleneck. We believe recent works by Key et al. (2022) and Vaduguru et al. (2024) using sample-and-rerank may be used in generating the communicative dataset instead of the exact RSA algorithm.
Impact Statement
This work builds a system where end-users may use examples to generate programs. While the proposed method is more intuitive to use by humans, it is possible that for some interactions, it may generate unexpected programs. Therefore, it could be of potential danger when humans do not manually verify the generated program, as it may have unintended outcomes when executed.
Acknowledgements
The authors would like to thank Kevin Ellis, Pei Wang, and Jesse Wang for preliminary explorations in this direction and insights to the proof. SV was partially supported by a gift from Autodesk Research. This material is based upon work supported by the NSF under Grant Nos. CCF-2123965 and IIS-2107391. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
References
- Andreas & Klein (2016) Andreas, J. and Klein, D. Reasoning about pragmatics with neural listeners and speakers. arXiv preprint arXiv:1604.00562, 2016.
- Balog et al. (2016) Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. ICLR, 2016.
- Bergen et al. (2016) Bergen, L., Levy, R., and Goodman, N. Pragmatic reasoning through semantic inference. Semantics and Pragmatics, 9:ACCESS–ACCESS, 2016.
- Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
- Cohn-Gordon et al. (2018a) Cohn-Gordon, R., Goodman, N., and Potts, C. Pragmatically informative image captioning with character-level inference. arXiv preprint arXiv:1804.05417, 2018a.
- Cohn-Gordon et al. (2018b) Cohn-Gordon, R., Goodman, N. D., and Potts, C. An incremental iterated response model of pragmatics. arXiv preprint arXiv:1810.00367, 2018b.
- Ellis & Gulwani (2017) Ellis, K. and Gulwani, S. Learning to learn programs from examples: Going beyond program structure. IJCAI, 2017.
- Feser et al. (2015) Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from input-output examples. PLDI ’15, pp. 229–239, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334686. doi: 10.1145/2737924.2737977. URL https://doi.org/10.1145/2737924.2737977.
- Frank & Goodman (2012) Frank, M. C. and Goodman, N. D. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998, 2012.
- Franke & Degen (2016) Franke, M. and Degen, J. Reasoning in reference games: Individual-vs. population-level probabilistic modeling. PloS one, 11(5):e0154854, 2016.
- Fried et al. (2018a) Fried, D., Andreas, J., and Klein, D. Unified pragmatic models for generating and following instructions. In Proceedings of North American Chapter of the Association for Computational Linguistics, 2018a.
- Fried et al. (2018b) Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018b.
- Fried et al. (2021) Fried, D., Chiu, J. T., and Klein, D. Reference-centric models for grounded collaborative dialogue. arXiv preprint arXiv:2109.05042, 2021.
- Goodman & Frank (2016) Goodman, N. D. and Frank, M. C. Pragmatic language interpretation as probabilistic inference. Trends in cognitive sciences, 20(11):818–829, 2016.
- Gulwani (2011) Gulwani, S. Automating string processing in spreadsheets using input-output examples. SIGPLAN Not., 46(1):317–330, jan 2011. ISSN 0362-1340. doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423.
- Key et al. (2022) Key, D., Li, W.-D., and Ellis, K. I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848, 2022.
- Lewis (1979) Lewis, D. Scorekeeping in a language game. Journal of philosophical logic, 8:339–359, 1979.
- Li et al. (2022) Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Lin et al. (2022) Lin, J., Fried, D., Klein, D., and Dragan, A. Inferring rewards from language in context. arXiv preprint arXiv:2204.02515, 2022.
- Monroe & Potts (2015) Monroe, W. and Potts, C. Learning in the rational speech acts model. arXiv preprint arXiv:1510.06807, 2015.
- Monroe et al. (2017) Monroe, W., Hawkins, R. X., Goodman, N. D., and Potts, C. Colors in context: A pragmatic neural model for grounded language understanding. Transactions of the Association for Computational Linguistics, 5:325–338, 2017.
- Muggleton FREng (2023) Muggleton FREng, S. Hypothesizing an algorithm from one example: the role of specificity. Philosophical Transactions of the Royal Society A, 381(2251):20220046, 2023.
- Polosukhin & Skidanov (2018) Polosukhin, I. and Skidanov, A. Neural program search: Solving programming tasks from description and examples. arXiv preprint arXiv:1802.04335, 2018.
- Polozov & Gulwani (2015) Polozov, O. and Gulwani, S. Flashmeta: A framework for inductive program synthesis. ACM SIGPLAN Notices, 50(10):107–126, 2015.
- Pu et al. (2020) Pu, Y., Ellis, K., Kryven, M., Tenenbaum, J., and Solar-Lezama, A. Program synthesis with pragmatic communication. Advances in Neural Information Processing Systems, 33:13249–13259, 2020.
- Singh & Gulwani (2015) Singh, R. and Gulwani, S. Predicting a correct program in programming by example. In CAV, pp. 398–414. Springer, 2015.
- Smith et al. (2013) Smith, N. J., Goodman, N., and Frank, M. Learning and using language via recursive pragmatic reasoning about other agents. Advances in neural information processing systems, 26, 2013.
- Solar-Lezama (2008) Solar-Lezama, A. Program synthesis by sketching. PhD thesis, USA, 2008. AAI3353225.
- Solar-Lezama et al. (2006) Solar-Lezama, A., Tancau, L., Bodik, R., Seshia, S., and Saraswat, V. Combinatorial sketching for finite programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pp. 404–415, 2006.
- Vaduguru et al. (2022) Vaduguru, S., Ellis, K., and Pu, Y. Efficient pragmatic program synthesis with informative specifications, 2022.
- Vaduguru et al. (2024) Vaduguru, S., Fried, D., and Pu, Y. Generating pragmatic examples to train neural program synthesizers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=yxKZGQLzOP.
- Vaithilingam et al. (2023) Vaithilingam, P., Pu, Y., and Glassman, E. L. The usability of pragmatic communication in regular expression synthesis, 2023.
- Vogel et al. (2013) Vogel, A., Bodoia, M., Potts, C., and Jurafsky, D. Emergence of gricean maxims from multi-agent decision theory. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 1072–1081, 2013.
Appendix A Code and Assets
Please find all simulation, replay results at this repository https://github.com/evanthebouncy/pragmatic_synthesis_ranking/tree/main
Appendix B Simulated Studies
B.1 Ranking Always Exists
We empirically validate that in the case of single utterances, a ranking can always be found. See simulation/single_utter/exp_exists_orders.py
B.2 Stability of Ranks Across RSA Iterations
We’ve shown that for every , there exists a corresponding global, utterance agnostic ranking . We now explore the relationship between these rankings as a function of the RSA iteration . Specifically, how stable is the relative ranks of and once it is formed?
Stable Order
A pair-wise order between and is stable from iteration onward if:
Which means the relative ranking of holds true for every subsequent iterations until . Let the minimal-index of a stable pair-wise ordering be the first iteration such that becomes stable:
(6) |
As is the first time any ranking can exist ( is a uniform distribution over valid hypotheses, i.e. no rankings), we explore the following: For a lexicon , what fraction of stable orderings have a minimal-index of 1?
(7) |
Simulation
We measure on a population of sampled random boolean lexicons. We sample square lexicons of size . Each lexicon is sampled with , where larger value of makes the lexicon have more 1s. We make sure each sampled lexicon is valid in the following sense: (1) all rows are unique – every utterance must communicate a unique subset of valid hypotheses (2) all columns are unique – every hypothesis has a unique set of utterances that can refer to it. For every combination of we randomly sample 100 lexicons. As it is infeasible to run RSA until iteration , we run RSA for 100 iterations for each lexicon (i.e. ). We measure for each sampled lexicon. The result is shown in 9. As we can see, of all the stable pair-wise orderings, a large fraction () are formed during , this is increasingly true as we (1) increase , making the boolean lexicons having more number of 1s – i.e. the lexicon is more ambiguous for a literal speaker and listener and (2) increase . We suspect this is due to faster “mixing time” of the RSA algorithm under these conditions, but this is just a guess.
Takeaway This study may provide an alternative explanation as to why humans do not perform RSA for more than few iterations (Franke & Degen, 2016). In addition to it being computationally expensive, it is also not necessary as the majority of top-k orderings becomes available at , and remains stable for all subsequent iterations of the RSA algorithm. In another word, . Code in simulation/single_utter
Appendix C Animals domain
In the Animals domain, a program is a pattern on a grid formed from a set of objects. These objects may be a colourless pebble, or a chicken or pig that may be red, green or blue. An utterance reveals one square on the grid, and the speaker has to communicate the pattern by choosing which square to reveal. The pattern is formed according to rules specified in the domain-specific language in Figure 10. Examples of programs shown in Figure 11. The description of the domain-specific language and the examples are due to Vaduguru et al. (2022).
Appendix D Human study interface
The interface for the human study on regular expression programs is shown in Figure 12.
Box(Left, Right, Top, Bottom, Thickness, Outside, Inside) | |||
0 | 1 | 2 | 3 | ... | 6 | |||
0 | 1 | 2 | 3 | ... | 6 | |||
0 | 1 | 2 | 3 | ... | 6 | |||
0 | 1 | 2 | 3 | ... | 6 | |||
1 | 2 | 3 | |||
chicken | pig | |||
chicken | pig | pebble | |||
x | y | x + y | |||
Appendix E Neural model
The neural scoring model maps from the program to a real number. The program is input as vector encoding the productions of the grammar that produce the program. That is, we construct a vector of the index of the production that is used to expand each non-terminal in the DSL grammar. We then convert this vector to a one-hot matrix. There are 12 rules, with any single rule having at most 7 possible expansions resulting in an input vector of dimension 12 7 84. The input is then passed through 3 hidden layers of size 128, each of which has as ReLU activation, and then mapped to a scalar output with a linear layer.
The model is trained on a dataset of rankings of the form . For each program , we sample a pair of programs from the inferred ranking and use this pair to compute the loss function for this sample. We train the model for a maximum of 20 epochs, where one epoch of training corresponds to presenting the model with every element in once. We train with a batch size of 32 using the Adam optimizer. We use a validation set generated similarly to (on a disjoint set of programs) to perform validation, choosing the model that results in the highest synthesis accuracy on this validation dataset with synthetically produced examples (from the speaker model).
We train an ensemble of 10 models. For each model, we normalize the scores to be of zero mean and unit variance based on the empirical mean and standard deviation computed on the validation set. We then average the scores for the 10 models at inference time.