Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation

Nastaran Bassamzadeh  Chhaya Methani
Microsoft Corporation
Redmond, USA
Abstract.

Natural Language to Code Generation has made significant progress in recent years with the advent of Large Language Models (LLMs). While generation for general-purpose languages like C, C++, and Python has improved significantly, LLMs struggle with custom function names in Domain Specific Languages or DSLs. This leads to higher hallucination rates and syntax errors, specially for DSLs having a high number of custom function names. Additionally, constant updates to function names add to the challenge as LLMs need to stay up-to-date. In this paper, we present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies. We generated a train as well as test dataset with a DSL to represent automation tasks across roughly 700 APIs in public domain. We used the training dataset to fine-tune a Codex model for this DSL. Our results showed that the fine-tuned model scored the best on code similarity metric. With our RAG optimizations, we achieved parity for similarity metric. The compilation rate, however, showed that both the models still got the syntax wrong many times, with RAG-based method being 2 pts better. Conversely, hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for API parameter keys. We conclude that an optimized RAG model can match the quality of fine-tuned models and offer advantages for new, unseen APIs.

Code Generation, NL2CodeGen, DSL, NL2DSL, RAG, Fine-Tuning

1. Introduction

There has been significant progress made in improving and quantifying the quality of Natural Language to Code Generation or NL2Code ((Chen et al., 2021), (Nguyen and Nadi, 2022), (Feng et al., 2020), (Chen et al., 2021)). Recent improvements in models for general-purpose languages like Python, C++ and Java can be attributed to larger LLMs ((OpenAI, 2022), (OpenAI, 2023a)) and the availability of pre-trained open-source models ((Feng et al., 2020), (Meta, 2023), (Abdin et al., 2024), (Mistral AI, 2024)) advancing the state-of-the-art. However, there hasn’t been a focus on improving quality of Natural Language to Domain Specific Languages or NL2DSL which a lot of enterprise applications rely on.

Domain Specific Languages (or DSLs) are custom Computer Languages designed and optimized for specific applications. Examples of DSLs include SQL and industry-specific languages for formalizing API calls, often using formats like JSON or YAML to represent API sequences. In this paper, we focus on the task of generating a DSL used for authoring high-level automation workflows across thousands of web-scale APIs. These workflows support a variety of customer scenarios like invoice processing, sales lead integration with forms/emails etc. The automation DSL represents API names as functions and codifies a sequence of API calls along with conditional logic over the invocation of APIs. We constrained the length of sequence to 5 APIs and hope to explore longer sequences in future work. An example of the DSL is shown in Figure 1.

Existing code generation methods are hard to adapt for this scenario due to the frequent hallucinations and syntax errors. This is largely due to the custom names, massive size and diversity of APIs in public as well private domain along with the ever-changing API landscape. Current NL2Code methods mainly use fine-tuning and do not focus on strategies for improving grounding LLMs to include new APIs.

In this paper, we outline an end to end system architecture for NL2DSL generation with high response rate using selective improvements to RAG techniques ((Liu et al., 2023), (Poesia et al., 2022)) using OpenAI models. We fine-tuned a Codex model for NL2DSL and show a comparative analysis of the impact of the approaches used to optimize RAG.

Along with metaprompt tuning for RAG, we also included additional grounding context in the form of API Function Definitions, like the approach used for Tool Selection ((Schick et al., 2023),(Shen et al., 2023), (Liang et al., 2023), (Patil et al., 2023)). This is motivated by the similarities between the code generation and task orchestration scenarios discussed in more detail in Section 2.

The remainder of this study is structured as follows. In Section 2, we present the NL2DSL problem formulation along with literature review. The focus is on comparing differences between Tool Selection of APIs as a framework compared to Code Generation over a set of APIs. This will help define the scope of the experiments in this study. Section 3 lays out and describes the optimizations we made to RAG as discussed above along with the benchmark Fine-Tuned model. Section 4 discusses Data Generation, Metric definition and Section 5 shares our results and discussion followed by Conclusion and Future Work in Section 6.

2. Related Work

2.1. Code Generation or Program Synthesis

Program Synthesis is a hard research problem ((Jain et al., 2021), (Devlin et al., 2017), (Feng et al., 2020),(Li et al., 2022), (Xu et al., 2021)). It has gained significant interest with many open-source models focusing on general programming languages since the release of Github Copilot ((Chen et al., 2021)). These models include Code Llama (Meta, 2023), StarCoder (Li et al., 2023), Codestral (Mistral AI, 2024), Phi-3 (Abdin et al., 2024) and more. Many of these advancements have been achieved through pre-training language models for code generation with a focus on improving datasets(((Meta, 2023), (Abdin et al., 2024))). However, for domain adaptation, instruction fine-tuning on top of a base model remains a popular approach ((Chen et al., 2021), (Gao et al., 2023), (Lewkowycz et al., 2022), (Patil et al., 2023)).

Prompting LLMs is an alternative technique for code generation ((Liu et al., 2023), (White et al., 2023), (Wei et al., 2023), (Kojima et al., 2023)). Poesia et al. ((Poesia et al., 2022)) focused on improving response quality through grounding techniques. They fine-tuned a Sentence BERT model by changing the loss function to incorporate predicting similarity of the generated target programs. With this adapted similarity metric, better few shots are selected dynamically.

2.2. Reasoning and Tool Integration

When it comes to modeling the problem of selecting a sequence of API calls, we need to consider formulating it as a planning or reasoning task. LLMs show remarkable reasoning capability, however, they also have limitations when it comes to staying up-to-date with recent knowledge, performing mathematical calculations etc. A popular way to overcome this has been granting the LLMs access to external tools. This framework gained significant popularity with OpenAI Code Interpreter’s success ((OpenAI, 2023b)).

External Tool Integration has been studied since with a focus on including specific tools such as web search ((Schick et al., 2023)), python code interpreters ((Gao et al., 2023), (OpenAI, 2023b)), adding calculators ((Parisi et al., 2022) (Gao et al., 2023)) and so on. Expanding the tool set to a generic list of tools has been explored ((Schick et al., 2023), (Patil et al., 2023)), but it remains limited and often predicts single tools instead of sequences needed for most enterprise scenarios. Tool Use has mostly been explored in the context of generating more accurate text outputs for Q&A tasks with the help of external tools((Schick et al., 2023), (Parisi et al., 2022)).

There is an increase in focus on incorporating LLM’s code generation capabilities to reasoning and task orchestration, this is an area of active research ((Gao et al., 2023), (Liang et al., 2023), (Patil et al., 2023)). However, most of the research either limits the tools to a set of small well-documented APIs ( ((Gao et al., 2023),(Liang et al., 2023)), or limited their scope to predicting a single output API ((Patil et al., 2023), (Schick et al., 2023)).

Posing the reasoning or orchestration task as a code generation problem is similar to the API sequence generation scenario highlighted in this paper. Representing a plan as a DSL, as discussed in Section 1, aligns with our goal of generating DSL for workflow automation. Improving the quality of Natural Language to DSL generation, is thus beneficial for both reasoning and plan generation.

2.3. Contributions

In the previous section, we discussed formulating Task Orchestration as a Code Generation problem since it can be represented as yet another DSL. NL2DSL generation suffers from the hallucination and quality issues we discussed in 1. Few studies address the challenges of end-to-end DSL generation, specifically over a large set of custom APIs.

This paper presents improvements to known RAG techniques focusing on improving DSL generation quality for enterprise settings. Our DSL expands API or tool selection to a sequences of 5-6 API calls, also referred to as chain of tools, which is a first to the best of our knowledge. We also consider the real-world scenarios of adding conditional logic with API calls as shown with an example in Figure 1. Our contribution is outlining an end-to-end system as well as presenting an ablation study for NL2DSL generation.

We merged prompting and grounding approaches from code generation ((Poesia et al., 2022),(Liu et al., 2023),(White et al., 2023)) and added API metadata as used in task orchestration area ((Gao et al., 2023), (Liang et al., 2023)) and studied their impact on reducing hallucination rate. We created a test set having 1000 NL-DSL pairs spanning over a set of approx. 700 API calls or functions using principles of synthetic dataset generation (similar to (Honovich et al., 2022) and (Schick and Schütze, 2021)) and used manual approval to validate test set quality. Our fine-tuned DSL model is trained on a larger synthetic NL-DSL dataset (details in Section 3.1).

Refer to caption
Figure 1. System Architecture to show e2e working of & our DSL generation methodology using RAG. TST based semantic mapping & retrieves the relevant code snippet as shown. This helps get the right syntax. However, & it gets the correct function name for approval from the API metadata&
\Description

System Architecture along with NL and DSL pairs from our dataset. Note, function names are indicative to show API functionality for this illustration.

3. Methodology

In this section, we first provide an overview of the approaches used in our experiments. In the following sub-sections, we will delve deeper in the details of each of the approaches.

Details of fine-tuning are shared in Section 3.1. Fine-Tuning a base model, specifically, instruction fine-tuning is a preferred approach for domain adaptation. It’s limitations include inability to include newly added APIs on an ongoing basis, as well as the resource intensive data collection process for infrequently used APIs or the tail set.

We used RAG based approaches to overcome these limitations, and focused on improving grounding techniques for DSL generation (Details in Section 3.2). We used dynamically generated few-shot examples approach ((Poesia et al., 2022)), and augmented it with API function definitions similar to the way it is used for Tool Selection ((Patil et al., 2023), (Shen et al., 2023)). These few-shots were selected from an expansive pool of synthetic NL-DSL pairs, empirically having 100s of variations of usage for each API (Section 4.1).

For computing semantic similarity of the few-shots with the input user query (Section 3.2), we fine-tuned a BERT model as highlighted in (Poesia et al., 2022) with a modified loss function for predicting target DSL similarity. For selecting the API metadata for grounding (Section 3.3), we created an index over API Function Definitions. We also tried metaprompt tuning, but limit the focus of this study to improving grounding techniques with a combination of dynamically selected few-shot samples as well as API metadata or tool description.

We share the details of each approach and variation below.

3.1. Fine-Tuned NL2DSL Generation Model

We took the Codex base model from OpenAI due to it’s pre-training with code samples and used LoRA-based fine-tuning approach. The training set consists of NL-DSL pairs, NL refers to the user query and the DSL represents the workflow that the user is looking to automate. We used ¡START¿ and ¡END¿ token to indicate the end of code generation to the model. The training set consists of a pool of 67k samples in the form of (prompt, flow) tuples generated synthetically ( details in Section 4.1, and examples of NL-DSL are shared in Figure 1 and Appendix A).

We ran many iterations on this model to improve performance on the test set, specifically for the body and tail connectors, and went through multiple rounds of data augmentation. We found that predicting the parameter keys was very challenging with the fine-tuned model due to limitation of data generation. Even with synthetic models, it was hard to scale the NL-DSL sample variety needed for improving quality of parameters.

3.2. Grounding with dynamically selected few-shots

We tried two types of grounding information for RAG based DSL generation as described below. There are some variations of each technique described in the paragraph below as well. For each technique, we selected 5555 and 20202020 shots dynamically, and saw performance impact driven by the approach used for sample selection.

3.2.1. Pre-Trained Model

The first approach is using a vanilla Per-Trained model for determining the semantic similarity of NL-DSL samples based on the NL query. We computed the embeddings of NL queries using a Distil-RoBERTa Pre-Trained model. We created a Faiss Index ((Douze et al., 2024)) for these embeddings to help with search over the dense embedding space.

3.2.2. TST based BERT Fine-tuning

In this approach, we fine-tuned the pre-trained model to improve the retrieval accuracy of the few-shots. This is similar to the approach used by Poesia et al. in (Poesia et al., 2022). They show that if we fine-tune the pre-trained BERT model with a modified loss function to consider the similarity between the target DSL for each NL-DSL pair, the retrieved examples will have a higher quality and finally lead to better generation with LLM.

To get positive and negative samples for fine-tuning, we compared cosine similarity between all pairs of Natural Language queries in our dataset (Dataset shared in Section 4.1). We used a Pre-Trained Tansformer model to generate embeddings for the purpose of similarity computation. A pair of tuples is considered a positive sample if the similarity between their corresponding NL prompts is greater than 0.7 and negative otherwise. We generated 100k pairs this way and leveraged them as training data for our fine-tuning experiment.

The loss function used by TST (Equation 1 from (Poesia et al., 2022)) is minimizing the Mean-Squared Error between the vanilla loss functions comparing the utterances (ui,ujsubscript𝑢𝑖subscript𝑢𝑗u_{i},u_{j}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) and the target programs (pi,pjsubscript𝑝𝑖subscript𝑝𝑗p_{i},p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Program similarity is denoted by S𝑆Sitalic_S. They used AST to compute program similarity, however, we used a Jaccard score over lists of API function names to be consistent with our metrics definition (Section 4).

(1) LTST(θ):=Ei,jD[fθ(ui,uj)S(Pi,pj)]2assignsubscript𝐿𝑇𝑆𝑇𝜃subscript𝐸𝑖𝑗𝐷superscriptdelimited-[]subscript𝑓𝜃subscript𝑢𝑖subscript𝑢𝑗𝑆subscript𝑃𝑖subscript𝑝𝑗2L_{TST}(\theta):=E_{i,j~{}D}[f_{\theta}(u_{i},u_{j})-S(P_{i},p_{j})]^{2}italic_L start_POSTSUBSCRIPT italic_T italic_S italic_T end_POSTSUBSCRIPT ( italic_θ ) := italic_E start_POSTSUBSCRIPT italic_i , italic_j italic_D end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_S ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

3.3. Grounding with API Metadata

In addition to few-shots, we appended the API metadata in the metaprompt. This metadata includes Function Description along with the parameter keys and their description (See an example API Function Definition shared in Appendix A). We followed the below two approaches for selecting the metadata to be added.

3.3.1. API Function Definitions for Few Shots

For the few-shots samples selected using the methods described above, we extracted the metadata for each of the functions present in those samples. This means that for the n𝑛nitalic_n few-shot samples dynamically added to the metaprompt, we iterated over all the API function names in each of these flows and added their function definitions to the metaprompt.

We also modified the metaprompt to add instructions on how to use the Function Definitions. We want to explore how adding the metadata explaining the purpose of each function in the few-shot examples impacts LLM’s understanding of the task and map to user request.

3.3.2. Semantic Function Definitions

Another approach for selecting the function definitions to be added to the metaprompt is to retrieve the semantically similar functions from a vector database created with API metadata. This approach is similar to the one followed by LlamaIndex ((LlamaIndex, 2023)) We created an index of all API definitions and retrieved the semantically similar functions by using the input NL query to search the index. Please note that this is different from the faiss index created for few-shot samples in Section 3.2.

We call this approach Semantic Function Definition (SFD) and will compare it with the Regular FDs described above. This approach can be specifically useful for tail-ish prompts where no few-shots might be retrieved. This helps us integrate the newly released web APIs in our DSL Generation framework making our approach scalable to the changing API landscape.

4. Experiment Design and Metrics Definition

In this section, we outline the process of Dataset Generation and introduce the metrics we used for estimating the code quality. We then describe the experiments. Results and Discussion follows in the next section. We have used Azure AML pipelines to run our experiments. The GPT-4 (with 16k token limit) model is used as the LLM model. The metaprompt is kept consistent between experiments for the purpose of the ablation study.

4.1. Dataset Generation

We generated a total of 67k samples in the form of (prompt, flow) pairs from workflows created by users. We had many samples of workflow automations created by users across a large set of APIs. We sampled the automations containing 700700700700 publicly available APIs and synthetically generated the corresponding Natural Language prompts using GPT-4. For creating these NL descriptions for the workflows, we also provided API Function definitions to the metadata. This ensured the language of the description captured the functioanlity of the API.

A subset of these synthetic samples were validated by human judges. We used these checks to improve the metaprompt used for synthetic data generation. For creating a test set, we used the same process with most of the test set evaluated by human judges to ensure quality. We followed the same distribution of APIs from users, to ensure that our metrics are not biased. The test data set consists of 1000100010001000 samples that are verified by human judges.

4.2. DSL Generation Quality Metrics

We defined 3 key metrics to focus on code generation quality as well as syntactic accuracy and hallucination rate. We have a compiler to test the syntax and validate the functions against a database of API names as well as parameter keys.

4.2.1. Average Similarity

Average Similarity measures the aggregated similarity between predicted flow and the ground truth flow. The average similarity between two flows is defined using the Longest Common Subsequence match (LCSS) metric. Each flow is reduced to a list of API call sequences and then the LCSS is computed. The final metric is reported as an average over all test samples. Hallucination and Parser failures lead to the sample being discarded and is assigned a similarity score of 0.

(2) Similarity=LCSS(A,B)max(|ActionsA|,|ActionsB|)SimilarityLCSS𝐴𝐵𝑚𝑎𝑥subscriptActions𝐴subscriptActions𝐵\textrm{Similarity}=\frac{\mathrm{LCSS}(A,B)}{max(|\mathrm{Actions}_{A}|,|% \mathrm{Actions}_{B}|)}Similarity = divide start_ARG roman_LCSS ( italic_A , italic_B ) end_ARG start_ARG italic_m italic_a italic_x ( | roman_Actions start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT | , | roman_Actions start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | ) end_ARG

where |ActionsA|subscriptActions𝐴|\textrm{Actions}_{A}|| Actions start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT | is the number of actions in flow A𝐴Aitalic_A and |ActionsB|subscriptActions𝐵|\textrm{Actions}_{B}|| Actions start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | is the number of actions in flow B𝐵Bitalic_B.

Please note that we are not using the commonly used AST metric for computing code similarity. AST drills down to compare similarity performance for parameters as well. As we wanted to focus on the problem of improving function name retrieval as well as it’s sequence, we chose to define the metric in this manner.

Table 1. Compare impact of selecting 5 vs 20 few shot samples for both TST vs. Pre-trained Model without adding API function definitions using GPT-4. All results are shown as ΔΔ\Deltaroman_Δ improvements compared to the baseline. The baseline uses Pre-Trained Transformer Model with 5 few-shot samples. For Avg. similarity, higher is better, and for the rest of metrics capturing failure rates, lower is better.
Model Num of Few-Shots Avg. Similarity %non-parsed flows %made-up API names %made-up parameters
Pre-trained Model wo FD 20 +0.030.03+0.03bold_+ bold_0.03 3.373.37-3.37bold_- bold_3.37 7.347.34-7.34- 7.34 15.1715.17-15.17bold_- bold_15.17
TST wo FD 5 +0.020.02+0.02+ 0.02 0.610.61-0.61- 0.61 3.533.53-3.53- 3.53 1.041.04-1.04- 1.04
TST wo FD 20 +0.030.03+0.03bold_+ bold_0.03 2.852.85-2.85- 2.85 8.498.49-8.49bold_- bold_8.49 14.5814.58-14.58- 14.58
Table 2. Impact of selecting 5 few shot samples using TST vs. Pre-trained Model with and without API Function Definitions using GPT4 model. All results are shown as ΔΔ\Deltaroman_Δ improvements compared to the baseline. The baseline uses Pre-Trained Transformer Model without API Function Definitions. For Avg. similarity, higher is better, and for the rest of metrics capturing failure rates, lower is better.
Model Avg. Similarity %Unparsed flows %made-up API names %made-up API parameters
Pre-trained Model + FD 00 +2.752.75+2.75+ 2.75 4.34.3-4.3- 4.3 20.1620.16-20.16bold_- bold_20.16
TST wo FD +0.020.02+0.02bold_+ bold_0.02 0.610.61-0.61bold_- bold_0.61 3.533.53-3.53- 3.53 1.041.04-1.04- 1.04
TST + FD +0.020.02+0.02bold_+ bold_0.02 +0.680.68+0.68+ 0.68 6.296.29-6.29bold_- bold_6.29 19.9919.99-19.99- 19.99
Table 3. Impact of selecting 20 few shot samples using TST vs. Pre-trained Model with and without function definitions using GPT4 model. All results are shown as ΔΔ\Deltaroman_Δ improvements compared to the baseline. The baseline uses Pre-Trained Transformer Model without API Function Definitions. For Avg. similarity, higher is better, and for the rest of metrics capturing failure rates, lower is better.
Model Avg. Similarity %Unparsed flows %made-up API names %made-up API parameters
Pre-trained Model + FD 0.010.01-0.01- 0.01 +2.292.29+2.29+ 2.29 2.172.17-2.17- 2.17 6.936.93-6.93- 6.93
TST wo FD 00 +0.520.52+0.52bold_+ bold_0.52 1.151.15-1.15- 1.15 +0.520.52+0.52+ 0.52
TST + FD +0.020.02+0.02bold_+ bold_0.02 +0.830.83+0.83+ 0.83 2.72.7-2.7bold_- bold_2.7 7.067.06-7.06bold_- bold_7.06

4.2.2. Unparsed rate

This metric captures the rate of syntactic errors. A flow that cannot be parsed by the parser is considered not usable for the purpose of similarity metric computation. Unparsed rate is computed as follow:

(3) %unparsedflows=|Flowsunparsed||Flowstotal|\%\mathrm{unparsed\ flows}=\frac{|\mathrm{Flows}_{\mathrm{unparsed}}|}{|% \mathrm{Flows}_{\mathrm{total}}|}% roman_unparsed roman_flows = divide start_ARG | roman_Flows start_POSTSUBSCRIPT roman_unparsed end_POSTSUBSCRIPT | end_ARG start_ARG | roman_Flows start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT | end_ARG

where, |Flowsunparsed|subscriptFlowsunparsed|\mathrm{Flows}_{\mathrm{unparsed}}|| roman_Flows start_POSTSUBSCRIPT roman_unparsed end_POSTSUBSCRIPT | is the number of flows that were not parsed and |Flowstotal|subscriptFlowstotal|\mathrm{Flows}_{\mathrm{total}}|| roman_Flows start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT |is the total number of flows in the sample set.

4.2.3. Hallucination rate

This metric captures the rate of made-up APIs (or function names) and made-up parameter keys in the generated code. Predicting a flow with a hallucinated API name is counted as a failure and leads to the code being considered invalid.

We compute this by counting the number of flows that have at least one hallucinated function name and divide it by the total number of flows in the sample set.

(4) %madeupAPIs=|Flowsh||Flowsparsed|100\%\mathrm{made-up\ APIs}=\frac{|\mathrm{Flows}_{h}|}{|\mathrm{Flows}_{\mathrm{% parsed}}|}*100% roman_made - roman_up roman_APIs = divide start_ARG | roman_Flows start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | end_ARG start_ARG | roman_Flows start_POSTSUBSCRIPT roman_parsed end_POSTSUBSCRIPT | end_ARG ∗ 100

where |Flowsh|subscriptFlows|\mathrm{Flows}_{h}|| roman_Flows start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | is the number of flows with hallucinated API names and |Flowsparsed|subscriptFlowsparsed|\mathrm{Flows}_{\mathrm{parsed}}|| roman_Flows start_POSTSUBSCRIPT roman_parsed end_POSTSUBSCRIPT | is the number of flows that were parsed correctly.

Similarly, we compute the rate at which parameters were not parsed. Failure to parse parameters does not result in the flow being discounted from average similarity computation. However, it shows up as run-time errors. Fixing these run-time errors is beyond the scope of this paper.

(5) %madeupparameters=|Flowshp||Flowsparsed|100\%\mathrm{made-up\ parameters}=\frac{|\mathrm{Flows}_{hp}|}{|\mathrm{Flows}_{% \mathrm{parsed}}|}*100% roman_made - roman_up roman_parameters = divide start_ARG | roman_Flows start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT | end_ARG start_ARG | roman_Flows start_POSTSUBSCRIPT roman_parsed end_POSTSUBSCRIPT | end_ARG ∗ 100

where, |Flowshp|subscriptFlows𝑝|\mathrm{Flows}_{hp}|| roman_Flows start_POSTSUBSCRIPT italic_h italic_p end_POSTSUBSCRIPT | is the number of flows with hallucinated parameter key names and |Flowsparsed|subscriptFlowsparsed|\mathrm{Flows}_{\mathrm{parsed}}|| roman_Flows start_POSTSUBSCRIPT roman_parsed end_POSTSUBSCRIPT | is the number of flows that were parsed correctly.

5. Results

In this section, we present the results of the above approaches on a test set of 1000 NL-DSL pairs. These samples, while generated synthetically, have been evaluated by human judges for quality. They were also sampled to represent the distribution of APIs in actual product usage.

We compare the impact of each ablation in sections below.

5.1. Impact of number of few-shots on RAG performance

We compare the impact of number of code samples added to the meta prompt with two different settings i.e. 5 few-shots vs 20 few-shots. We measured the results for both Pre-Trained model as well as TST model. Results are shared in Table 1 and show the ΔΔ\Deltaroman_Δ change compared to that Baseline model. The baseline setting here is Pre-Trained model with 5 few-shots.

Looking at row 1 and comparing rows 2 and 3 with respect to the baseline , we can see that adding more few-shots improves the performance of both the Pre-Trained as well as the TST model on all metrics. The gain is particularly pronounced for reducing the number of made-up API names as well as reducing the number of made-up API parameter keys. We saw the gain plateau beyond this, and we intend to run more experiments in the future to study this effect better.

Table 4. Impact of adding API or tool related metadata on performance (with GPT-4 model and 20 few shots). FD refers to including only metadata for APIs present in few-shots. SFD refers to extracting APIs similar to the input query (Refer to Section 3) for details. The baseline uses fine-tuned Codex model. For Avg. similarity, higher value is better, and for the rest of metrics capturing failure rates, lower is better.
Model Avg. Similarity %Unparsed flows %made-up API names %made-up API parameters
TST + FD 𝟎0bold_0 5.35.3-5.3bold_- bold_5.3 +1.71.7+1.7+ 1.7 +1.111.11+1.11bold_+ bold_1.11
TST + SFD 0.010.01-0.01- 0.01 1.431.43-1.43- 1.43 +1.211.21+1.21+ 1.21 +6.766.76+6.76+ 6.76
TST + FD + SFD 𝟎0bold_0 2.742.74-2.74- 2.74 +0.940.94+0.94bold_+ bold_0.94 +2.032.03+2.03+ 2.03

5.2. TST vs Pre-trained Model

Comparing the rows in Table 1, both Pre-Trained and TST with 20 samples look comparable for computing the Average Similarity but have slight variations in Unparsed flow rate as well as Hallucinations rates. TST model performs better in reducing the %percent\%% made-up API names, while the Pre-trained model has a slight edge in the other two metrics.

So, we additionally look at the impact of including API Function Definitions to both the models (see Table 2). Here, we have used GPT4 model with 5 few shots. The results are represented as ΔΔ\Deltaroman_Δ changes compared to the Baseline setting i.e. using the Pre-Trained model to choose 5555 few-shot NL-DSL code samples. TST with FD setting performs overall better than all other options with values close to the best in every metric.

We see a similar trend in Table 3 where we captured the results for 20202020 few-shots. This leads us to conclude that the presence of few-shot examples is supported by adding the API functions definitions of these functions (as described in Section 3). The addition predominantly helps reducing the hallucination rate for API names and parameters, which improves the overall response rate of NL2DSL generation.

5.3. Function Definition vs Semantic Function Definitions

As the next step, we will compare the impact of Semantic Function Definitions (SFD) vs adding the API Function Definitions for selected examples only. We used a Fine-Tuned model as baseline for this experiment. Based on the insights from the previous step, we used 20 few-shots for TST along with including FDs. All results in Table 4 are shown as ΔΔ\Deltaroman_Δ improvements compared to the baseline.

Looking at metrics in columns for %percent\%% made-up API names and %percent\%% made-up parameter keys, we see that the hallucination rate is in general increasing for RAG based approach. However, we need to keep in mind that a fine-tuned model on the function names is hard to beat as it has been trained on 67,000 samples compared to only 20 few-shots that have been added to the RAG model.

Within the RAG approaches, comparing rows 1 and 2 (”TST + FD” vs ”TST + SFD”), SFD in general results in a slight drop in average similarity and an increase in the Unparse rate as well as hallucination rate for parameter keys. This indicates that the approach to simply add semantically similar API metadata for a query is not useful for DSL generation. We get better similarity, as well as reduced Hallucination Rate when we include the API Function Definitions for the samples selected by TST (as shown in Row 1).

The addition of Semantically matching Function Definitions tends to reduce the hallucination of API names indicating that it could have potential of adding FDs that are not a part of the code sample set. This could have implications for improving the performance for newly added APIs in the public cloud, that will help keep the performance of the system updated. We will explore this topic in a future study.

6. Conclusion

Concluding from the ablations study shared in Section 5, we see that the role of dynamically selected few-shot samples is very important in making RAG useful for syntactically correct generation of DSL as well as improving code similarity ((Table 4)).

Fine-Tuning still outperforms the RAG based model in terms of lower hallucinations (see Table 4 where fine-tuned model is the baseline). However, the parsing errors are more common in the fine-tuned model. This could be due to the fact that few shot examples have been successfully teaching the correct syntax to the LLM model. It is, however, surprising that the syntax correctness for RAG is better than that of the fine-tuned model which was trained on a much larger sample set.

It is also interesting to note that this benefit does not transfer to hallucinated API names and their parameters keys where the fine-tuned model holds the advantage. The increase of 6.76 pts in hallucination rate for parameters due to adding Semantic Function definitions indicates that adding too many API descriptions can confuse rather than help the LLM (Table 4). It also signifies the higher impact of the few shot samples for the scenario of DSL Generation or API selection compared to simply providing the API description. This learning can be used to inform the Tool Selection or orchestration scenario. Providing high quality examples of sample orchestration will reduce the failure rate more.

Overall, we were able to significantly improve the performance of RAG for DSL generation, with hallucination rate for API names dropping by 6.29 pts. and that of parameter keys dropped by approx. 20 pts (see Table 2). The performance of RAG is now comparable to that of fine-tuned model (see Avg. Similarity in Table 4), with the potential to bootstrap quickly. Optimized RAG can also allow extending the benefits of metaprompt tuning to include unseen APIs, reducing the need to fine-tune the model frequently. This will be the focus of our future work.

References

  • (1)
  • Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374
  • Devlin et al. (2017) Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. arXiv:1703.07469
  • Douze et al. (2024) Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. arXiv:2401.08281 [cs.LG] https://arxiv.org/abs/2401.08281
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: Program-aided Language Models. arXiv:2211.10435
  • Honovich et al. (2022) Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. arXiv:2212.09689 [cs.CL]
  • Jain et al. (2021) Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2021. Jigsaw: Large Language Models meet Program Synthesis. arXiv:2112.02969 [cs.SE] https://arxiv.org/abs/2112.02969
  • Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916
  • Lewkowycz et al. (2022) Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. arXiv:2206.14858
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! arXiv:2305.06161
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-level code generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. https://doi.org/10.1126/science.abq1158
  • Liang et al. (2023) Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. 2023. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs. arXiv:2303.16434 [cs.AI]
  • Liu et al. (2023) Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, Xiaohong Zhang, and Meng Yan. 2023. Improving ChatGPT Prompt for Code Generation. arXiv:2305.08360 [cs.SE] https://arxiv.org/abs/2305.08360
  • LlamaIndex (2023) LlamaIndex 2023. LlamaIndex. https://llama.meta.com/docs/integration-guides/llamaindex/
  • Meta (2023) Meta 2023. Code Llama: Open Foundation Models for Code. https://doi.org/10.48550/arXiv.2308.12950
  • Mistral AI (2024) Mistral AI 2024. Codestral. https://mistral.ai/news/codestral/
  • Nguyen and Nadi (2022) Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 1–5. https://doi.org/10.1145/3524842.3528470
  • OpenAI (2022) OpenAI 2022. ChatGPT. https://openai.com/blog/chatgpt
  • OpenAI (2023a) OpenAI 2023a. Gpt-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf
  • OpenAI (2023b) OpenAI 2023b. OpenAI Code Interpretor. https://platform.openai.com/docs/assistants/tools/code-interpreter
  • Parisi et al. (2022) Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. TALM: Tool Augmented Language Models. arXiv:2205.12255
  • Patil et al. (2023) Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334 [cs.CL] https://arxiv.org/abs/2305.15334
  • Poesia et al. (2022) Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv:2201.11227 [cs.LG] https://arxiv.org/abs/2201.11227
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761
  • Schick and Schütze (2021) Timo Schick and Hinrich Schütze. 2021. Generating Datasets with Pretrained Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6943–6951. https://doi.org/10.18653/v1/2021.emnlp-main.555
  • Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580 [cs.CL] https://arxiv.org/abs/2303.17580
  • Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
  • White et al. (2023) Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. arXiv:2303.07839 [cs.SE] https://arxiv.org/abs/2303.07839
  • Xu et al. (2021) Frank F. Xu, Bogdan Vasilescu, and Graham Neubig. 2021. In-IDE Code Generation from Natural Language: Promise and Challenges. arXiv:2101.11149 [cs.SE]

Appendix A Appendix

A.1. Sample with computed Average similarity

Sample showing how flow similarity is computed for two flows Flow A and Flow B.


Query = "Post a message in the channel of teams,
when a new form is created in the forms"

Ground Truth = "triggerOutputs=
await shared\_microsoftforms.CreateFormWebhook({});
outputs_shared_teams_PostMessageToConversation=
shared_teams.PostMessageToConversation(
{ \"poster\": \"User\"});"

prediction:"triggerOutputs=
awaitshared_microsoftforms.CreateFormWebhook({});
outputs_Get_my_profile_V2 = shared_office365users.
MyProfile_V2({}); outputs_shared_teams_PostMessage
= shared_teams.PostMessageToConversation(
{\"poster\": \"User\",\"location\": \"Channel\"});"

API Functions list in ground_truth =
[shared_microsoftforms.CreateFormWebhook,
shared_teams.PostMessageToConversation]


API function list in model generation=
[shared_microsoftforms.CreateFormWebhook,
shared_office365users.MyProfile_V2,
shared_teams.PostMessageToConversation]

Similarity Score = 2/3 = 0.666

Since the functions shared_microsoftforms.
CreateFormWebhook and shared_teams.
PostMessageToConversation are found
in the ground truth.

A.2. An example of API metdata

We share a sample of API metadata to highlight the details included in the API description provided to the metaprompt.


"shared_outlook.SendEmailV2": {
    "FunctionName": "shared_outlook.SendEmailV2",
    "Description": "This operation sends an email message.",
    "IsInTrainingSet": false,
    "DisplayName": "Send an email (V2)",
        "ParametersInfo": [
            {
                "Key": "emailMessage/To",
                "Type": "String",
                "Summary": "To",
                "Format": "email",
                "Description": "Specify email addresses
                    separated by semicolons like
                    someone@contoso.com"
            }, .
                        ],
        "ResponseSchema": [],
        "IsTrigger": false
    }