research-article

Open access

Revealing the Unseen: AI Chain on LLMs for Predicting Implicit Dataflows to Generate Dataflow Graphs in Dynamically Typed Code

Authors:

Yong ChenAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 7

Article No.: 183, Pages 1 - 29

https://doi.org/10.1145/3672458

Published: 27 September 2024 Publication History

PDF eReader

Abstract

Dataflow graphs (DFGs) capture definitions (defs) and uses across program blocks, which is a fundamental program representation for program analysis, testing and maintenance. However, dynamically typed programming languages like Python present implicit dataflow issues that make it challenging to determine def-use flow information at compile time. Static analysis methods like Soot and WALA are inadequate for handling these issues, and manually enumerating comprehensive heuristic rules is impractical. Large pre-trained language models (LLMs) offer a potential solution, as they have powerful language understanding and pattern matching abilities, allowing them to predict implicit dataflow by analyzing code context and relationships between variables, functions, and statements in code. We propose leveraging LLMs’ in-context learning ability to learn implicit rules and patterns from code representation and contextual information to solve implicit dataflow problems. To further enhance the accuracy of LLMs, we design a five-step chain of thought (CoT) and break it down into an Artificial Intelligence (AI) chain, with each step corresponding to a separate AI unit to generate accurate DFGs for Python code. Our approach’s performance is thoroughly assessed, demonstrating the effectiveness of each AI unit in the AI Chain. Compared to static analysis, our method achieves 82% higher def coverage and 58% higher use coverage in DFG generation on implicit dataflow. We also prove the indispensability of each unit in the AI Chain. Overall, our approach offers a promising direction for building software engineering tools by utilizing foundation models, eliminating significant engineering and maintenance effort, but focusing on identifying problems for AI to solve.

1 Introduction

A dataflow graph (DFG), also known as a def-use graph, captures the flow of definitions (defs) and uses across basic blocks in a program [1]. It provides a more detailed view of a program’s dataflow than the control flow graph (CFG) [2] and offers significant benefits in program testing [3], analysis [4], variable tracking [5], and code maintenance [6–8]. However, dynamically typed programming languages like Python have many dynamic features that make it difficult to determine def-use flow information of variables at compile time, resulting in implicit dataflow (IDF) problems [9–11]. These dataflows are computed at runtime, passed implicitly to other parts of the program [12], and are difficult to track or detect in the source code. In Python, IDF problems often manifest in operations, such as variable assignment, comprehension evaluation, and function or method calls [13].

Variable assignment uses a reference semantics mechanism that assigns a reference to an object to a variable name. When multiple variables refer to the same mutable object, object sharing occurs, and modifying one variable affects other variables using the same object [14]. For instance, in Figure 1(a1), variables “a” and “b” share the same list object, and updating “a[3]” to 5 also affects “b.” The corresponding DFG is shown in Figure 1(a2). However, static analysis methods struggle to identify which variables share the same object during compilation, resulting in an IDF problem, namely the loss of “def: $\{b\}$ ” information in orange.

Fig. 1.

Comprehension is a concise syntax for generating new data structures like list comprehensions and generator expressions [12]. However, this syntax has a complex calculation process that relies on dynamic expression evaluation characteristics, making it difficult to track dataflow and analyze expression results. For instance, in Figure 1(b1), lines 2–4 show a list comprehension where “x” and “a” are implicitly defined and used at runtime. Such defs and uses become explicit in the corresponding non-idiomatic Python code shown in Figure 1(b3). The corresponding DFG is shown in Figure 1(b2). However, static analysis methods cannot track the information of “def: $\{x\}$ ” and “use: $\{a\}$ ” in orange for the list comprehension during compilation, resulting in an IDF problem, namely the loss of this information.

Parameter passing in function or method calls follows the call-by-reference mechanism, where a reference to the object is passed instead of the object itself. This approach can lead to callable object side effects [15] during execution, as callable objects can modify the reference objects passed to them and affect other parts of the program. For example, in Figure 1(c1), the “pop()” method is called on the variable “a,” and the resulting value is assigned to variable “result.” The corresponding DFG is illustrated in Figure 1(c2). However, static analysis methods are unable to determine the internal semantics of the called function or method during compilation, resulting in an IDF problem, namely the loss of “def: $\{a\}$ ” information in orange when analyzing dataflow.

Static analysis methods like Soot [16] and WALA [17] are inadequate in handling IDF problems, as they fail to capture and analyze def-use flow information of variables at compile time. These methods only consider static structural information [18] and ignore the dynamic information generated during runtime. As a result, static analysis cannot predict or understand runtime behavior, which prevents it from determining which dataflows are affected by dynamic expression evaluation and how these effects propagate and affect program behavior.

To address this issue, a potential solution is to gain a deep understanding of dynamic expression evaluation characteristics, such as short-circuiting [19] and lazy evaluation [20], and create heuristic rules [21, 22] based on code context to automatically infer variable def-use flow information. These rules can then be integrated into static analysis methods to improve their accuracy. However, manually enumerating all possible rules is impractical.

Fortunately, large pre-trained language models (LLMs) have powerful language understanding and pattern matching abilities [23–27]. By learning various language structures and programming styles from a large amount of programming language data, LLMs acquire a deep understanding and representation of the language [27–29], including contextual information, language patterns, and dynamic behavior. This allows them to search for key features in programs, infer relationships between variables, functions, and statements in code by understanding the context of the program, discover hidden patterns and regularities, and infer the IDF in the code. For example, LLMs can analyze variable assignments and function calls in code, determining which dataflows are affected, and how these effects propagate and affect program behavior, thereby predicting whether a variable’s value may be affected by modifications to other variables or whether a function call may modify the passed-in parameter objects.

In this article, we propose to leverage the in-context learning ability of LLMs to discover and solve IDF problems [24, 30, 31]. This method learns implicit rules and patterns from the code’s representation and contextual information, rather than relying solely on explicit syntax and semantic information. By leveraging code context to understand code semantics, LLMs, without actually executing the code, can better predict IDFs that would be generated during program execution, identify the causes and directions of IDFs in code, capture the def-use flow information of variables, and generate more complete and accurate DFGs.

However, leveraging LLMs to directly predict accurate IDFs is challenging due to LLM’s uncertainty, errors, and hallucination problems [32–34]. These issues can cause LLMs to miss def-use flow information of variables, especially in expressions with IDF problems. To address this issue, we design a chain of thought (CoT) [35, 36], which involves five steps.

—

Import Statements Completion, which assists LLMs in predicting the API semantics of the called function or method to eliminate callable object side effects.

—

Comprehension Transformation, which expands a comprehension structure to a series of simple algebraic operations and simplification steps to track IDF in the original comprehension structure.

—

Program Slicing, which captures the code related to each variable, including code that updates variable values due to object sharing.

—

Def-use Extraction, which obtains def-use flow information for each variable separately.

—

Def-use Flow Fusion, which combines the def-use flow information of all variables into a complete DFG.

This CoT still has limitations due to its use of a single prompt to implement all the step duties, which can lead to error accumulation and the creation of an “epic” prompt with too many step duties that are difficult to optimize and control [37]. To overcome these limitations, we adopt the principle of single responsibility from software engineering [38] and break down the CoT in to an Artificial Intelligence (AI) chain [39, 40], with each step corresponding to a separate AI unit. We develop an effective prompt for each AI unit which performs separate LLM calls. This AI chain can interact with LLMs step by step to generate DFG for Python code.

We conduct three experiments to evaluate the performance of our DFG generation approach. First, we verify the usefulness of each unit in our AI chain, achieving an accuracy rate of 98%, 84.1%, and 84.8% for Import Statements Completion, Comprehension Transformation, and Program Slicing, respectively. The def coverage and use coverage for Def-Use Extraction and Def-use Flow Fusion are found to be higher than 82%. Second, we compare our approach with other baseline methods and find that our approach achieves 82% higher def coverage and 58% higher use coverage than baselines. Third, we conduct an ablation experiment that confirmed the effectiveness of our AI chain design. Finally, we apply our approach to various LLMs to demonstrate its generalizability.

The main contributions of this article are as follows:

—

Instead of integrating manual heuristic rules into static analysis methods, we leverage LLMs’ language understanding and pattern matching abilities to capture def-use flow information of variables and predict IDF occurring at runtime.

—

Our informative CoT approach with five steps addresses challenges, such as object sharing, tracking dataflow in comprehension evaluation, and side effects of callable objects.

—

We implement our CoT in a well-designed modular AI chain with separate AI units to improve the reasoning reliability.

—

Our experiments demonstrate that our approach has strong def-use flow information prediction abilities and can generate DFGs with more complete and accurate def-use flow information.

2 Problem Definition

In contrast to conventional static analysis-based methods for generating DFGs, our approach leverages the comprehensive understanding and pattern recognition capabilities of LLMs. This enables us to deduce IDF within the code, resulting in the generation of DFGs with enhanced accuracy. Our research centers on the identification of three specific types of IDF: object sharing, IDF in comprehension and callable object side effects. These forms of IDF are fully observable within the confines of a single function and do not necessitate spanning across multiple functions or involving class-level. By confining our analysis to the scope of a single function, we can focus our efforts on meticulously tracking and analyzing variable assignments and their passing within the IDF. Conversely, broadening the analysis to encompass multiple functions or class levels introduces complexity in verifying the effectiveness of our method in capturing these IDFs. This complexity arises due to the need to account for not only variable assignments and passing but also intricate factors, such as inter-function interactions, class inheritance, and polymorphism. Such complexities diminish the efficiency and precision of verification processes. To ensure the efficiency and accuracy of verification, we impose a scope limitation on the input code: the input code only involves a single function, without considering multiple functions or class-level scenarios.

3 Approach

The main challenge in generating accurate DFGs for dynamically typed programming languages is yielding IDF information, which can only be computed at runtime and thus cannot be determined at compile time using static analysis methods. To address this issue, we propose a new approach called DFG-Chain, which utilizes LLMs’ powerful context-understanding and pattern matching abilities to predict runtime behavior and obtain IDF. To better design this approach, we simulate the human thought process and break down the task into single-responsibility sub-problems, designing functional units linked in a serial, parallel, or split-merge structure to create a multi-round interaction with the LLM to solve problems step by step. Within our approach, we interact with generative pre-trained transformer (GPT)-3.5¹ using the API corresponding to the gpt-3.5-turbo-16k² model. It should be pointed out that our approach mainly focuses on the concerned problem itself including task characteristics, data properties, and information flow, rather than the selection of LLMs. Thus, besides GPT-3.5, other LLMs should be also applicable to our approach. Moreover, unlike the way of fine-tuning LLMs, which requires significant effort in data gathering, preprocessing, annotation, and model training, our approach does not involve such an intensive process.

3.1 Hierarchical Task Breakdown

Given a Python code illustrated in Figure 2(a1), its corresponding accurate DFG is shown in Figure 2(b). However, due to IDFs in the code, directly inputting it into the LLM will result in missing def-use flow information, as seen in the orange part of Figure 2(b).

Fig. 2.

First, line 4 contains callable object side effects, and the LLM struggles to predict the internal semantics of “shuffle()” to eliminate side effects on variable “a.” This leads to an IDF problem, with the loss of “def: $\{a\}$ ” information in Figure 2(b)’s orange part.

Second, line 3 involves object sharing between variables “a” and “b,” with updates to “a” affecting “b.” The LLM cannot directly identify shared objects, causing another IDF problem, namely the loss of “def: $\{b\}$ ” information in Figure 2(b)’s orange part.

Third, the list comprehension in lines 6–8 of the code presents difficulty in tracking dataflow as LLM cannot always predict the internal semantics of the “pop()” method used within the list comprehension, resulting in challenges on capturing the def-use flow information for all relevant variables. Consequently, an IDF problem arises, causing the loss of “def: $\{x\}$ , def: $\{b\}$ , and use: $\{c\}$ ” information in the orange part of Figure 2(b).

Furthermore, when the code contains three different types of IDFs simultaneously, it may cause compounded errors. This means that directly using LLM to generate DFGs could result in the omission of def-use flow information, except for the orange part in Figure 2(b). Therefore, predicting IDFs and generating DFGs in a single step is infeasible. Instead, it is necessary to decompose this task into sub-problems, with each having a single responsibility, and solve them step by step.

To tackle these single-responsibility sub-problems, we design AI and non-AI units and create an AI chain, as depicted in Figure 3. The first AI unit, Import Statement Completion, is designed to complete import statements for given code, aiding LLMs in predicting internal semantics of the called function or method to eliminate side effects in subsequent AI units. For instance, with the “shuffle()” method, the LLM can determine from the code context that it belongs to the “random” library and complete the appropriate import statements. The second AI unit in our approach, Comprehension Transformation, expands complex comprehension to a series of simple algebraic operations and simplification (such as loop structure) to track IDF within the comprehension. In addition, we use a non-AI unit called Examples Selection to provide appropriate prompt examples for the Comprehension Transformation unit. Then, the code without comprehensions is fed to the Program Slicing AI unit, which extracts relevant slice code containing def-use flow information for each variable, including code that updates variables values due to object sharing. With each variable’s corresponding slice code, the Def-Use Extraction AI unit generates def-use flow information. The Def-use Flow Fusion AI unit takes the code without comprehension, slice code, and the nodes and edges of the CFG as input, producing a DFG with complete def-use flow information. It shoulde be noted that a non-AI unit called Extraction of CFG Nodes and Edges is employed to acquire the CFG’s nodes and edges.

Fig. 3.

3.2 Prompt Design for AI Units

In this section, we discuss the process of designing natural language prompts for invoking LLMs to perform various factorized AI unit functions.

According to empirical studies [29, 41], prompt design involving task descriptions and examples is crucial. To ensure standardization, we have developed a generic template including a task description and a few input–output examples. We use the example of Import Statements Completion unit (as shown in Figure 4) to illustrate how the template is constructed, which completes the import statements in a Python code.

Fig. 4.

The top of the template contains a description (e.g., “Complete missing import statements in the Python code”) in green, followed by five input–output examples in the middle (e.g., Input: “original code: def test():…,” Output: “from numpy.random import shuffle…”), and an input (Python code) and an output (e.g., code with import statements) at the bottom. Although we show the input and output side by side, they are presented sequentially in the actual prompt.

It is worth noting that we have pre-selected five examples for all AI units in this work. While adaptability generally improves with more examples [29], studies have shown that the accuracy gains beyond five examples are limited [42]. Given this, we manually prepare five examples for each of the four AI units, excluding the Comprehension Transformation unit. In doing so, we carefully consider the representativeness and diversity of the examples. For instance, in the Import Statements Completion unit, we ensure that each example contains different import statements, such as numpy and random. For the Comprehension Transformation unit, as shown in Table 1, we prepare seven sets of examples. These sets are used based on the type of comprehension present in the input code. For instance, if the input code contains both list comprehensions and set comprehensions, we select the LC and SC set from Table 1, which includes two LC2LP (List Comprehension to Loop Structure) and two SC2LP examples.

In the subsequent sections, we will provide a detailed description of the prompt design for each AI units.

Table 1.

Example Set	LC2LP	SC2LP	DC2LP
Example Set	EN	EN	EN
LC Set	5	0	0
SC Set	0	5	0
DC Set	0	0	5
LC and SC Set	2	2	0
LC and DC Set	2	0	2
SC and DC Set	0	2	2
LC and SC and DC Set	2	2	2

Table 1. Seven Sets of Examples

LC, SC, and DC denote list, set, and dict comprehensions, respectively. LC2LP, SC2LP, and DC2LP correspond to examples of transforming these comprehensions to loop structures. EN stands for the number of examples in the example set.

3.2.1 Import Statements Completion Unit.

To address callable object side effects, we use the Import Statements Completion unit to fill in any missing import statements in code. This enriches the input for subsequent units, such as the Def-Use Extraction unit, which is explained in Section 3.2.4. With the completed import statements, the LLM can make more accurate predictions regarding the internal semantics of the called function or method, which is the logic and operations within them. For example, in the Def-Use Extraction unit, after completing the import statement for “shuffle(a1),” the LLM can predict whether the value of “a1” is updated or only used by analyzing the internal semantics of “shuffle(a1).”

To prompt the LLM for the import statements completion task, we utilize a generic template illustrated in Figure 4. The template includes a task description stating “Complete the import statements in the Python code.,” five examples, and a space for the code to complete its import statements. To provide representative examples, we select the five code examples related to the numpy and random libraries. We choose these libraries because numpy³ is widely utilized for scientific computing, while random⁴ is commonly employed for generating pseudo-random numbers. Both libraries find extensive application in fields, such as data science, machine learning, and numerical computing, where code often involves the definition and utilization of numerous variables [43].

3.2.2 Comprehension Transformation Unit.

Tracking the def-use flow information of variables in code containing comprehensions (e.g., list, set, and dictionary comprehensions) can be challenging due to IDF issues. To address this, we utilize the Comprehension Transformation unit to expand comprehensions into simple algebraic operations and simplify them. The prompt for this task, shown in Figure 5, includes a task description (“Rewrite the following Python code by expanding…”) and five examples, with input as code with import statements and output as code without comprehension structure. To assist the LLM in completing this task, we utilize the non-AI unit called Examples Selection (see Figure 3) to determine the comprehension type present in the input code dynamically. This non-AI unit then selects the appropriate set of examples from seven example sets (see Table 1) by traversing the abstract syntax tree (AST) nodes. Specifically, we first use the AST⁵ library to convert the input code (i.e., code with import statements) into the corresponding AST. Then, we traverse each node in the AST to identify the types of comprehensions (list, dict, and set comprehensions) present in the input code. Finally, based on the identified comprehension types, we decide which set of examples from seven sets to select. For instance, if we detect both list comprehensions and dict comprehensions in the input code, to ensure the representativeness of selected examples, we ensure that the selected examples include cases where list comprehensions and dict comprehensions are transformed into loop structures. Specifically, the selected set of examples (i.e., the LC and DC set in Table 1) comprises two examples converting list comprehensions into loop structures and two examples converting dictionary comprehensions into loop structures.

Fig. 5.

3.2.3 Program Slicing Unit.

To handle the IDF problem caused by object sharing, we introduce the Program Slicing unit, which extracts code for each variable containing its def-use flow information, including code that updates the variable’s value due to object sharing. The prompt for this AI unit, shown in Figure 6, includes a task description (“Perform program slicing…”) and five examples. The input is code without comprehensions, and the output is slice codes for all variables. To ensure the representativeness of our selected five examples, we adhere to the following three criteria: (1) We include import statements in the chosen examples, as the input code already contains them. (2) We exclude comprehensions from the selected examples, considering that the input code lacks them. (3) We focus on examples where two or more variables share the same object in the code, given the utilization of the Program Slicing unit to address object sharing issues. In the example prompt in Figure 6, a1 and a2 share an object, and the “shuffle()” method updates a1’s value, affecting a2. As a result, a2’s slice code includes “shuffle(a1).”

Fig. 6.

3.2.4 Def-Use Extraction Unit.

The Def-Use Extraction AI unit is responsible for extracting def-use flow information from each variable’s slice code. The prompt (shown in Figure 7) features a task description “Extract the def-use flow information…” and five examples. The input is the slice code for each variable, while the output presents the slice code with def-use flow information for each variable. Since the Import Statements Completion unit already completed the import statements, the LLM has access to the source library for the callable objects, enabling it to more precisely predict their internal semantics and extract def-use flow information.

Fig. 7.

3.2.5 Def-Use Flow Fusion Unit.

The final AI unit, Def-use Flow Fusion, generates a DFG with complete def-use flow information by integrating prior information. Figure 8 shows the prompt with a task description of “Fuse the given Python code…” and five examples. Each example includes code without comprehension structure, slice codes with def-use flow information, and nodes and edges of the CFG. This AI unit combines these inputs to produce a DFG with complete def-use flow information, while retaining the nodes and edges of the CFG, as mentioned in Section 1. A non-AI unit, Extraction of CFG Nodes and Edges (Figure 3), extracts nodes and edges from the Python code using the CFG Generator tool [44]. We utilize this tool for extracting nodes and edges from CFGs because it is derived from staticfg.⁶ staticfg is a library designed to generate CFGs for Python 3 programs, with 168 stars on GitHub.⁷ In contrast to staticfg, the CFG Generator derived from it has been enhanced to handle cases involving exception handling, lambda expressions, generator expressions, as well as list/set/dict comprehensions. This allows it to accurately generate CFGs for code containing these constructs.

Fig. 8.

3.3 Running Example

We provide an example to illustrate how AI units work together and how data are transformed among them. The example involves a Python code with three types of IDF, as shown in Figure 9(a).

Fig. 9.

First, the Python code is input into the Import Statements Completion unit, which fills in missing import statements for the code, shown in Figure 9(b). This unit completes import statements for the “shuffle()” and “choices()” method (i.e., from random import shuffle and from random import choices).

Next, we use the Comprehension Transformation unit to expand the list comprehension in lines 8–10 of Figure 9(b) to a loop structure. This results in the code shown in Figure 9(c).

Then, we address object sharing using the Program Slicing unit to slice the code without comprehensions for each variable. The result of this process is shown in Figure 9(d). Since variables “a” and “b” share the same list object, “shuffle(a)” updates the values of both “a” and “b,” resulting in the slice code for variable “b” containing “shuffle(a).”

Next, the slice codes of variables “a,” “c,” “x,” and “b” in Figure 9(d) are input into the Def-Use Extraction unit to extract def-use flow information. The AI unit processes them to extract def-use flow information for all slice codes, as shown in Figure 9(e). Note that the process is executed in parallel using multiple Def-Use Extraction units.

Finally, the slice codes of all variables are fed into the Def-use Flow Fusion unit to produce a DFG that includes full def-use flow information. The Def-use Flow Fusion unit integrates prior information from the previous AI units and produces the final DFG, as shown in Figure 9(f).

4 Experimental Setup

This section presents research question (RQ) to evaluate the performance of approach, along with experimental setup, which includes data preparation, baselines, evaluation metrics and LLMs introduction.

4.1 Rqs

We formulate four research questions to assess the performance of DFG-Chain in generating DFGs:

—

RQ1: What is the individual effectiveness of each AI unit within DFG-Chain?

—

RQ2: How accurate is the DFG generation by DFG-Chain compared to baselines?

—

RQ3: What is the impact of removing individual units and zero-shot prompts on the performance of the DFG-Chain in DFG generation?

—

RQ4: How does the performance of DFG-Chain generalize across different LLMs?

Model Parameter Configuration. In our experiments, our primary LLM used is GPT-3.5. In RQ4, to demonstrate the generalizability of our approach, we apply our approach to other LLMs (GPT-4 [26] and Code Llama [45]). The main parameter configurations for the other LLMs are shown in Table 2. The values of parameters not shown are set to their default values.

Table 2.

Parameters	max_tokens	Temperature	frequency_penalty
Value	2,049	0.0	0.0

Table 2. LLM Parameter Settings

First, max_tokens refers to the maximum number of tokens that LLMs can generate in response to a given input prompt. As shown in Table 2, we set it to a relatively high value of 2,049 to prevent the output from being truncated due to an excessive number of tokens. Second, temperature is commonly used to adjust the diversity of generated text. It is a parameter ranging between 0 and 2, utilized to control the level of randomness in text generated by a LLM. Higher temperatures lead to more random and diverse text outputs, whereas lower temperatures result in more deterministic and conservative outputs. As shown in Table 2, we set the temperature to 0 to ensure the stability of the LLM’s output. Finally, frequency_penalty is used to regulate the generation of text. Its typical range is between $-$ 2 and 2. Higher values discourage token repetition, fostering text diversity, while lower values permit more repetition, potentially resulting in more predictable outputs. As indicated in Table 2, we set it to 0 to allow for token repetition in the generated text. This choice ensures that certain words, such as “def” and “use,” appear multiple times as needed within the output text.

4.2 Data Preparation

To evaluate the proposed DFG-Chain, we begin with a large set of 240,000 complete method Python code samples from CodeNet [46]. We specifically chose complete methods to ensure the accuracy of the CFG generated by the CFG generator tool [44]. This allows us to concentrate on DFG generation without being impacted by the quality of the CFG. However, not all of the code samples contained the three types of IDFs that our approach aims to address (i.e., object sharing, IDF in comprehension, and callable object side effects). Therefore, we filtered the samples down to 8,671 that contained at least one of these flows. We chose code samples containing these three types of IDF for evaluation because they are primary obstacles preventing traditional static analysis methods from generating complete DFGs in dynamic languages like Python. Our aim is to assess our approach’s performance in capturing these dataflows and generating complete DFGs.

Similar to previous studies [47–49], we employ a sampling method [50] to ensure that the evaluation metrics observed in the sample can be generalized to the population within a certain confidence interval at a specific confidence level. For a error margin of 5 at a 95% confidence level, we randomly select 384 code samples from the filtered set, noting that a single code sample may have multiple IDFs. This selection consists of 287 samples with callable object side effects, 147 with IDF in comprehension, and 130 with object sharing. It’s worth noting that we remove the import statements originally included in the code samples to evaluate our approach’s ability to handle code lacking import statements. We refer to this dataset of 384 samples as the IDF dataset.

To comprehensively evaluate the effectiveness of DFG-Chain, we also construct a dataset without IDFs. In particular, we also chose 384 code samples from the pool of remaining code samples, ensuring they were devoid of IDFs. This selection excluded the 8,671 code samples that did contain IDFs. The resulting dataset, termed the Non-IDF dataset, serves as a comparative counterpart for evaluating our approach’s performance against code samples without IDFs.

Additionally, we evaluate the complexity of the two datasets from three aspects. First, the average length of code: the IDF dataset has an average length of 13.673 lines, while the non-IDF dataset has an average length of 11.446 lines. Second, the average number of if statements and for loops in the code: the IDF dataset has an average of 3.437 occurrences, while the non-IDF dataset has an average of 3.158 occurrences. Finally, we compute the average number of defs and uses in the DFG corresponding to the code: for the IDF dataset, the average number of def and use occurrences is 9.236 and 13.453, respectively, whereas for the non-IDF dataset, the average number of def and use occurrences is 8.886 and 11.275, respectively.

4.3 Baselines

Traditional static analysis tools, such as AST library⁸, Soot [16], and tree-sitter⁹ are widely used for generating DFGs, and we consider them as our baselines. Among these tools, AST library and tree-sitter are utilized to transform source code into ASTs and concrete syntax trees, respectively. Then, by traversing these tree nodes, we extract the def-use flow information of each variable. Soot, on the other hand, is a bytecode analysis tool designed for Java, but we can apply its principles to Python. This involves utilizing Python’s dis library¹⁰ to convert the code into bytecode, and then extracting the def-use flow information of each variable based on the bytecode. Subsequently, utilizing the CFG-Generator,¹¹ we construct the CFG of the code and map the def-use information of each variable to the code blocks of the CFG, ultimately generating the DFG.

However, both syntax trees and bytecode are generated at compile time and therefore cannot fully capture the IDF in dynamic programming languages. In dynamic programming languages, some variables’ def-use flow information can only be manifested during code execution. Thus, traditional static analysis methods cannot completely extract the def-use flow information of every variable in the code, leading to incomplete DFG generation.

We also design two variants of DFG-Chain to explore the effectiveness of our approach. On one hand, we compare DFG-CoT with DFG-Directly (DFG-D) to verify the effectiveness of CoT design. On the other hand, we compare DFG-Chain with DFG-CoT to verify the effectiveness of AI chain design. The details of the two baseline are as follows:

—

DFG-D (see Figure 10), which directly calls the LLM to generate the DFG of the Python code.

—

DFG-CoT (see Figure 11), a single-prompting approach that describes all steps in one chunk of prompt text and completes a single generative pass.

Fig. 10.

Fig. 11.

In addition, we conduct an ablation study of DFG-Chain to explain why it works. First, in order to demonstrate the necessity of each AI unit and non-AI unit, we design six variants as follows:

—

DFG- $Chain_{w/oISC}$ , which does not involve Import Statements Completion unit.

—

DFG- $Chain_{w/oES}$ , which does not involve Examples Selection unit.

—

DFG- $Chain_{w/oCT}$ , which does not involve Comprehension Transformation unit.

—

DFG- $Chain_{w/oPS}$ , which does not involve Program Slicing unit.

—

DFG- $Chain_{w/oDUE}$ , which does not involve Def-Use Extraction unit.

—

DFG- $Chain_{w/oECNE}$ , which does not involve Extraction of CFG Nodes and Edges unit.

We do not consider designing a variant without the Def-Use Flow Fusion unit because it is essential for DFG generation; without it, generating DFGs would be impossible.

Second, to demonstrate the effectiveness of the examples in each AI unit, we have also designed the following five variants:

—

DFG- $Chain_{ZS-ISC}$ represents a variant method operating in a zero-shot scenario, where the prompt for the Import Statements Completion unit solely comprises instructions, devoid of any accompanying examples.

—

DFG- $Chain_{ZS-CT}$ is a variant in which the prompt for the Comprehension Transformation unit solely comprises instructions, devoid of any accompanying examples.

—

DFG- $Chain_{ZS-PS}$ is a variant in which the prompt for the Program Slicing unit solely comprises instructions, devoid of any accompanying examples.

—

DFG- $Chain_{ZS-DUE}$ is a variant in which the prompt for the Def-Use Extraction unit solely comprises instructions, devoid of any accompanying examples.

—

DFG- $Chain_{ZS-DUFF}$ is a variant in which the prompt for the Def-Use Flow Fusion unit solely comprises instructions, devoid of any accompanying examples.

4.4 Evaluation Metrics

In RQ1, we use accuracy as evaluation metric for three tasks: Import Statements Completion, Comprehension Transformation,Program Slicing. Accuracy is a binary metric that indicates whether the output of each unit is correct or not, with a value of 1 indicating correct output and 0 indicating incorrect output. For both Def-Use Extraction and Def-use Flow Fusion, we employ def coverage and use coverage as evaluation metrics. This is because the output of Def-Use Extraction consists of slice codes with def-use flow information, and during Def-use Flow Fusion, we assess the DFG’s quality by evaluating the completeness of def-use flow information within the DFG.

In RQ2 and RQ3, we utilize def coverage, use coverage, and $DFG_{accuracy}$ to assess the performance of various techniques in generating the DFG from the Python code. Def coverage refers to the proportion of variable definitions correctly captured in the generated DFG, while use coverage indicates the proportion of variable uses correctly captured.Their respective calculation formulas are shown in Equations (1) and (2). In Equations (1) and (2), $def_{correct}$ and $use_{correct}$ refer to the number of correct defs and uses, respectively, in the generated DFG. $def_{all}$ and $use_{all}$ denote the total number of defs and uses in the ground truth. DFG’s def coverage and use coverage, respectively reflect the DFG’s ability to express variable definitions and uses in the program. Specifically, a high def coverage indicates that the DFG accurately reflects the points where variables are defined in the program, i.e., the locations where variables are defined; whereas a high use coverage implies that the DFG accurately reflects the points where variables are used in the program, i.e., the locations where variables are utilized. These two metrics gauge whether the DFG accurately represents the generation and propagation of data in the program. In addition to def coverage and use coverage, We utilize $DFG_{accuracy}$ to evaluate the performance of various techniques in generating DFGs. $DFG_{accuracy}$ represents the final ratio of examples in the dataset, for which the obtained DFG is entirely correct, as demonstrated in Equation (3). Unlike def coverage and use coverage, $DFG_{accuracy}$ primarily evaluates the actual practical usefulness of diverse techniques

\[def_{coverage}=\frac{def_{correct}}{def_{all}}\]

(1)

\[use_{coverage}=\frac{use_{correct}}{use_{all}}\]

(2)

\[DFG_{accuracy}=\frac{num\_DFG_{correct}}{num _{samples}}.\]

(3)

To establish the ground truth, we first utilize the CFG-Generator tool to extract the nodes and edges of the CFGs for all code samples in two datasets (see Section 4.2). Subsequently, a blend of dynamic analysis and manual inspection is applied. Specifically, we debug each code sample using PyCharm’s debugging feature, observing the console output to document the definition and usage of every variable. This process enables us to gather the def-use flow information. Finally, we map this def-use flow information to each node of the CFG to generate the DFGs. Throughout this procedure, we enlist the aid of three students: two master’s students and one doctoral student, each possessing more than 3 years of Python development experience. They independently debug code samples from both the IDF and non-IDF datasets using PyCharm’s debugging feature, while simultaneously documenting the definition and usage of variables. Should any discrepancies arise, the final adjudication falls to the doctoral student, ensuring the formulation of conclusive DFGs. The calculated Kappa coefficients [51] of 0.88 and 0.85 indicates almost perfect agreement. Consequently, we deem the DFGs derived from both the IDF and non-IDF datasets to be precise and comprehensive, thus serving as our ground truth.

In RQ4, we extend our method to various LLMs. In addition to assessing the effectiveness using metrics like def coverage, use coverage, and $DFG_{accuracy}$ , we also conduct an evaluation of the time and monetary expenses associated with our approach. This evaluation is carried out utilizing the metrics of average time and average cost.

4.5 Large Language Models

4.5.1 GPT-3.5.

GPT-3.5 is a large language model developed by OpenAI, built upon the architecture of GPT series [52]. It is pre-trained on a vast corpus of text data including open-source project code, programming language specifications, and so on. [53], and utilizes a transformer-based architecture [54]. GPT-3.5 has a massive scale with 175 billion parameters.

4.5.2 GPT-4.

GPT-4, like its predecessors in the GPT series, is built upon the GPT architecture. However, unlike GPT-3.5, GPT-4 is a large-scale multimodal model capable of accepting both image and text inputs and generating text outputs [26]. Initially, GPT-4.0 underwent fine-tuning using a combination of data obtained from ScaleAI and text data from OpenAI. Following this, it underwent additional fine-tuning using a reward model based on Reinforcement Learning from Human Feedback and the Proximal Policy Optimization algorithm [26, 55]. Estimates suggest that the model comprises approximately 1.8 trillion parameters [55, 56].

4.5.3 Code Llama.

Code Llama [45] builds upon the llama2 model introduced by meta AI, undergoing training and fine-tuning using code data. It offers multiple variations tailored to different applications: foundational models (Code Llama), Python-specific models (Code Llama—Python), and instruction-following models (Code Llama—Instruct), each with 7B, 13B, 34B, and 70B parameters, respectively. Here, we select Code Llama—Instruct because it undergoes instruction fine-tuning and alignment, resulting in better performance when understanding human natural language prompts.

5 Experimental Results

This section delves three research question to evaluate and discuss the performance of our approach.

5.1 RQ1: What Is the Individual Effectiveness of Each AI Unit within DFG-Chain?

5.1.1 Motivation.

The CoT approach inspires us to break down complex tasks into simple steps. However, the use of a single “epic” prompt in CoT-based methods limits its effectiveness and can lead to error accumulation. To address this, we develop an AI chain with explicit sub-steps, where each step corresponds to a separate AI unit. In this RQ, we investigate whether each AI unit in our approach can effectively ensure the accuracy of DFG generation.

5.1.2 Methodology.

We apply DFG-Chain to IDF dataset and collect intermediate results produced by each AI unit. After obtaining the interim results, we invite two students (one PhD and one MS student, both with over 5 years of Python experience) to assist in obtaining the ground truth for the three units: Comprehension Transformation, Program Slicing, and Def-Use Extraction, excluding Import Statements Completion and Def-Use Flow Fusion. This is because, for the Import Statements Completion unit, its ground truth can be sourced from the Codenet dataset [46], whereas for the Def-Use Flow Fusion, its ground truth (i.e., correct DFGs) has already been acquired in Section 4.4. We provide the inputs of the Comprehension Transformation, Program Slicing, and Def-Use Extraction units to two students for obtaining the ground truth of these three units. The calculated Kappa coefficient [51] of 0.83 indicates almost perfect agreement. Based on the ground truth of the five units, we calculate the efficiency of each unit. The results are presented in Table 3, and more metric information can be found in Section 4.4.

Table 3.

AI Unit	Accuracy	Def Coverage	Use Coverage
Import Statements Completion	0.980	-	-
Comprehension Transformation	0.841	-	-
Program Slicing	0.848	-	-
Def-Use Extraction	-	0.829	0.893
Def-Use Flow Fusion	-	0.922	0.886

Table 3. The Performance of Each AI Unit

5.1.3 Result Analysis.

Table 3 presents the experimental results of running DFG-Chain on the IDF dataset. The first unit, Import Statements Completion, achieves an impressive 98% accuracy on the IDF dataset, indicating its effectiveness in completing missing import statements for Python code, aiding LLM in predicting callable object side effects.

The second AI unit, Comprehension Transformation, attains 84.1% accuracy, suggesting its ability to expand comprehensions into simpler algebraic operations, addressing the IDF in comprehension.

With 84.8% accuracy in the third AI unit, Program Slicing, LLM can effectively slices code related to each variable, including updates variable values due to object sharing.

For the fourth AI unit, Def-Use Extraction, we observe strong def-use flow extraction with def and use coverage of 82.9% and 89.3%, respectively. This is attributed to filling missing import statements in Import Statements Completion, expanding comprehensions in Comprehension Transformation, and program slicing for each variable in Program Slicing unit, enabling efficient def-use extraction.

The final AI unit, Def-use Flow Fusion, attains a high def coverage of 92.2% and use coverage of 88.6%, indicating the LLM’s ability to accurately predict IDFs and generate DFGs with complete def-use flow information. Notably, the final AI unit produces the same results as our DFG-Chain, which are presented in Table 4. This is because the final AI unit’s output serves as our ultimate generated DFG.

Table 4.

Methods	IDF			Non-IDF
Methods	Def Coverage	Use Coverage	$DFG_{accuracy}$	Def Coverage	Use Coverage	$DFG_{accuracy}$
AST-Based	0.507	0.560	0.063	1	1	1
tree-sitter	0.494	0.536	0.060	1	1	1
$Soot_{python}$	0.532	0.588	0.070	1	1	1
DFG-CoT	0.625	0.615	0.344	0.776	0.743	0.440
DFG-D	0.599	0.576	0.135	0.633	0.621	0.245
DFG-Chain	0.922	0.866	0.813	0.976	0.933	0911

Table 4. The Results of Baselines vs. Our Approach

Numbers in bold indicate the highest metrics values.

The high metrics per AI unit confirm the effectiveness of prompt design and composition in connecting AI units for achieving higher-layer tasks effectively.

5.2 RQ2: How Accurate Is the DFG Generation by DFG-Chain Compared to Baselines?

5.2.1 Motivation.

First, We want to investigate if our approach can outperform baseline methods in generating complete def-use information flow, especially in the presence of IDF issues. Second, We want to evaluate the capability of our approach in generating completely correct DFGs, which is an assessment of the practicality of our approach. Finally, we aim to assess the performance of our approach on the dataset that do not contain IDF (i.e., non-IDF dataset). This evaluation will serve to demonstrate the comprehensiveness of our approach.

5.2.2 Methodology.

In this RQ, we implement six different DFG generation approaches: AST-based (i.e., AST library), tree-sitter, $Soot_{python}$ , DFG-Chain, DFG-D, and DFG-CoT (see Section 4.3). These approaches are applied to the IDF dataset and non-IDF dataset, and the resulting data are collected. Since the ground truth for both the IDF and non-IDF datasets is established in Section 4.4, we proceed by directly computing the metrics values for these datasets. Specifically, we calculate the def coverage, use coverage, and $DFG_{accuracy}$ using Equations (1)–(3), respectively. The metric values for def coverage, use coverage and $DFG_{accuracy}$ are presented in Table 4.

5.2.3 Result Analysis.

As shown in Table 4, our approach outperforms all baselines on the IDF dataset. In terms of capturing IDF in code, our approach achieves a def coverage of 92% and a use coverage of 88%. However, the def coverage achieved by the three types of static analysis methods stands at only 50.7%, 49.4%, and 53.2%, respectively. Similarly, the use coverage for these methods is only 56%, 53.6%, and 58.8%. This demonstrates that the three types of static analysis methods lack the ability to address IDF problems. For example, given a code with object sharing as shown in Figure 1(a1), static analysis methods struggle to identify which variables share the same object during compilation, resulting in an IDF problem, namely the loss of “def: $\{b\}$ ” information in orange, as shown in Figure 1(b).

In our evaluation of practicality, our approach achieves a $DFG_{accuracy}$ of 81.3%, surpassing the three static analysis methods (6.3%, 6%, and 7%). This highlights our approach’s ability to accurately generate DFGs for code with IDF. Notably, the $DFG_{accuracy}$ of the static analysis methods is notably low (6.3%, 6% and 7%) due to IDF present in each code within the IDF dataset, hindering their ability to produce accurate DFGs. Despite this, these methods still demonstrate some capability in generating accurate DFGs for select code samples with IDF. We observed that these code samples with entirely correct DFGs only exhibit the phenomenon of object sharing (see Section 1), and the values of several variables sharing the same object remain unchanged since their definition. For example, consider the following code: “a = [1, 2, 3, 4, 5] b = a…,” where “a” and “b” share the same list object. If the values of both “a” and “b” remain unchanged in subsequent code, static analysis methods can capture the def-use flow information of “a” and “b,” thus generating the correct DFG. However, if the value of either “a” or “b” changes, the value of the other variable also changes because they share the same object. The change in the value of the other variable can only be reflected at runtime, so static analysis methods cannot generate entirely correct DFGs.

On the non-IDF dataset, our approach achieves 97.6% def coverage, 93.3% use coverage, and 91.1% $DFG_{accuracy}$ , respectively. These metrics, combined with our performance on the IDF dataset, demonstrate the comprehensive nature of our approach. Regardless of whether the code samples contain IDF, our approach consistently attains high metric values. In contrast, all three static methods achieve 100% def coverage, use coverage, and $DFG_{accuracy}$ on the non-IDF dataset. This perfect performance is attributed to the absence of IDF in the code samples. Consequently, static methods can accurately extract the def-use flow information of variables, enabling them to generate complete DFGs.

In both the IDF and non-IDF datasets, DFG-CoT exhibits def coverage, use coverage, and $DFG_{accuracy}$ lower than those of DFG-Chain but higher than those of DFG-D. This suggests that our AI chain design outperforms CoT’s single-prompting approach, which completes all generative steps in a single pass using an “epic” prompt with hard-to-control behavior and error accumulation. In contrast, DFG-Chain breaks down CoT into an AI chain, with each step corresponding to a separate AI unit that performs separate LLM calls. This enables DFG-Chain to interact with LLMs step by step and generate DFGs for source code effectively.

Furthermore, we summarize failure patterns and plausible causes of our current approach to shed the light on improving our DFG Generation methods. First, One primary concern is the potential for error propagation. For example, if our Comprehension Transformation unit incorrectly expands a comprehension found within a code sample, it can result in the generation of an inaccurate DFG. However, such occurrences are infrequent, largely due to the robust code understanding abilities of the LLM. This is evident from the high efficiency demonstrated by each AI unit in RQ1. Another factor contributing to the failure of our approach is the presence of certain special syntax. This syntax hampers the LLM’s ability to accurately extract def-use flow information for relevant variables. Consequently, it impedes the generation of completely accurate DFGs. One prominent example of this is lambda expressions, exemplified by the code snippet: “result = filter(lambda x: x % 2 == 1, lst).” Due to the absence of AI units specifically tailored to handle such specialized syntax in our approach, the LLM struggles to appropriately extract the def-use flow information associated with variables “x.” We further analyze this failure in Section 6.3.

Standing on the shoulder of LLM for DFG generation, DFG-Chain has a strong ability to solve the implicit data flow issue in dynamic languages, resulting in completely correct DFGs. Each AI unit in the AI Chain follows the principle of single responsibility and can interact with LLMs separately to generate more complete DFGs.

5.3 RQ3: What Is the Impact of Removing Individual Units and Zero-Shot Prompts on the Performance of the AI Chain in DFG Generation?

5.3.1 Motivation.

First, we aim to investigate the impact of removing each unit from the AI Chain on the performance of DFG-Chain, thereby verifying the indispensability of each unit within DFG-Chain. Second, we seek to examine the effect of changing the number of examples in the prompts corresponding to AI units to zero on the performance of DFG-Chain, thereby validating the effectiveness of each example within the prompts.

5.3.2 Methodology.

First, we establish six approach variants (DFG- $Chain_{w/oISC}$ , DFG- $Chain_{w/oES}$ , DFG- $Chain_{w/oCT}$ , DFG- $Chain_{w/oPS}$ , DFG- $Chain_{w/oDUE}$ , and DFG- $Chain_{w/oECNE}$ ) and test their performance on the IDF dataset. We compare these results with the performance of DFG-Chain to validate the indispensability of each unit. Second, we set up another five approach variants (DFG- $Chain_{ZS-ISC}$ , DFG- $Chain_{ZS-CT}$ , DFG- $Chain_{ZS-PS}$ , DFG- $Chain_{ZS-DUE}$ , and DFG- $Chain_{ZS-DUFF}$ ) and conduct performance tests on the same IDF dataset to validate the effectiveness of examples in each prompt. The same method as RQ2 is employed to test these variants and calculate metric values.

5.3.3 Result Analysis.

The experimental results are presented in Table 5. From Table 5, it is evident that removing individual units leads to varying degrees of decrease in three metrics value (def coverage, use coverage, and $DFG_{accuracy}$ ). This underscores the indispensability of each unit within DFG-Chain. Among them, DFG- $Chain_{w/oISC}$ exhibits the smallest performance decrease, achieving 77.8% def coverage, 71.7% use coverage, and 63.6% $DFG_{accuracy}$ . This is because, even without import statements, there are occasions where LLM can still utilize its robust code comprehension and reasoning capabilities to determine whether a callable object (i.e., functions and methods) utilizes referenced variables or changes their values based solely on its name. For example, based on the name “mean,” LLM can infer that “np.mean(a)” is used to calculate the mean of array “a,” thereby analyzing that “np.mean” only utilizes “a” without changing its value. Conversely, the most significant performance decrease is observed in DFG- $Chain_{w/oECNE}$ . This is because if the Extraction of CFG Nodes and Edges non-AI unit is removed, the LLM loses access to the CFG as external knowledge. Consequently, the LLM is tasked not only with fuzing the def-use flow within the code but also with generating nodes and edges for the CFG. This violates the principle of single responsibility in software engineering, resulting in def coverage, use coverage, and $DFG_{accuracy}$ of only 54.7%, 48.7%, and 38.8%, respectively. Additionally, the DFG produced by DFG- $Chain_{w/oECNE}$ displays missing edges and erroneous connections.

Table 5.

Methods	Def Coverage	Use Coverage	$DFG_{accuracy}$
DFG-Chain	0.922	0.886	0.813
DFG- $Chain_{w/oISC}$	0.778	0.717	0.633
DFG- $Chain_{w/oES}$	0.767	0.741	0.612
DFG- $Chain_{w/oCT}$	0.657	0.559	0.487
DFG- $Chain_{w/oPS}$	0.710	0.793	0.620
DFG- $Chain_{w/oDUE}$	0.771	0.741	0.643
DFG- $Chain_{w/oECNE}$	0.547	0.487	0.388
DFG- $Chain_{ZS-ISC}$	0.812	0.794	0.742
DFG- $Chain_{ZS-CT}$	0.716	0.649	0.604
DFG- $Chain_{ZS-PS}$	0.734	0.797	0.677
DFG- $Chain_{ZS-DUE}$	0.792	0.759	0.693
DFG- $Chain_{ZS-DUFF}$	0.628	0.580	0.542

Table 5. Ablation Results of DFG-Chain Variants

Numbers in bold indicate the highest metrics values.

Analysis of Table 5 also reveals a notable trend: when the example count for any AI unit is set to zero, there is a discernible decline in performance of DFG-Chain. This emphasizes the effectiveness of each example within the prompt for DFG-Chain. The inclusion of examples in the prompt aids the LLMs in comprehending their responsibilities more effectively. While natural language instructions guide LLMs in understanding their responsibilities, research [41, 57, 58] indicates that incorporating examples provides additional contextual information, thereby enhancing its task comprehension and performance. Among the five approach variants, DFG- $Chain_{ZS-ISC}$ demonstrates the least performance decline, with def coverage, use coverage, and $DFG_{accuracy}$ at 81.2%, 79.4%, and 74.3%, respectively. Conversely, DFG- $Chain_{ZS-DUFF}$ exhibits the most significant performance decline, with def coverage, use coverage, and $DFG_{accuracy}$ only at 62.8%, 58.0%, and 54.2%, respectively. This discrepancy can be explained by the fact that different AI units are given the responsibilities with varying levels of difficulty. DFG- $Chain_{ZS-ISC}$ task involves completing missing import statements in the code, which is relatively simple, relying solely on the natural language description provided in the prompt. LLM is also capable of understanding its responsibilities to a significant extent. Conversely, DFG- $Chain_{ZS-DUFF}$ ’s responsibility of fuzing def-use flow within the code for DFG generation presents a higher level of difficulty. Without examples, the LLM encounters greater difficulty in understanding the definition of DFG and generating it solely based on natural language instructions.

(1) All units in our DFG-Chain are essential for generating a DFG with complete def-use information, and the absence of any one of them would result in an incomplete DFG. Moreover, among all the units, Extraction of CFG Nodes and Edges non-AI unit is the most crucial for DFG generation. (2) The inclusion of examples in DFG-Chain’s prompts enhances the performance of DFG-Chain.

5.4 RQ4: How Does the Performance of DFG-Chain Generalize across Different LLMs?

5.4.1 Motivation.

First, in RQ1 to RQ3, we solely utilize GPT-3.5 as the only LLM. In this RQ, our investigation focuses on assessing the performance of our approach across different LLMs, with a particular emphasis on examining the generalizability of our approach. Second, we aim to investigate the time and monetary expenses associated with our approach across various LLMs.

5.4.2 Methodology.

First, based on the IDF dataset we constructed, we apply our approach to different LLMs (i.e., GPT-3.5, GPT-4, and Code Llama). The basic parameter configurations for these LLMs are detailed in Table 2. We compare the performance of our approach across these three LLMs and discuss its generalizability. Second, we conduct an evaluation of the average time and monetary costs incurred by our approach across different LLMs, thereby delving into its implications in terms of time and financial resources. The metrics values for def coverage, use coverage, $DFG_{accuracy}$ , average time and average cost are presented in Table 6.

Table 6.

LLMs	Def Coverage	Use Coverage	$DFG_{accuracy}$	Average Time (seconds)	Average Cost ($)
GPT-3.5	0.922	0.866	0.813	148	0.0007
GPT-4	0.956	0.921	0.862	102	0.0810
Code Llama-7B	0.739	0.678	0.638	132	0
Code Llama-13B	0.767	0.744	0.672	199	0
Code Llama-34B	0.838	0.807	0.745	268	0
Code Llama-70B	0.916	0.859	0.805	316	0

Table 6. The Performance of DFG-Chain across Different LLMs

Numbers in bold indicate the highest metrics values.

5.4.3 Result Analysis.

As depicted in Table 6, our approach exhibits robust performance across three types of LLMs, showcasing its generalizability. Among these LLMs, GPT-4 stands out as the top performer, boasting def coverage, use coverage, and $DFG_{accuracy}$ of 95.6%, 92.1%, and 86.2%, respectively. The remarkable performance of GPT-4 can be largely attributed to its extensive parameter count of 1.8 trillion, which is nearly ten times that of GPT-3.5 (175 billion parameters). Within the Code Llama series, Code Llama-7B demonstrates the weakest performance, achieving a $DFG_{accuracy}$ of only 0.617. However, performance gradually improves with the increase in parameters, as evidenced by Code Llama-70B, which achieves a $DFG_{accuracy}$ of 0.805. Furthermore, it is noteworthy that despite GPT-3.5’s parameter count of 175 billion, its $DFG_{accuracy}$ only slightly surpasses that of Code Llama-70B, which possesses a parameter count of 70 billion (0.805 vs. 0.813). This discrepancy arises because GPT-3.5 lacks specialized training on code data, whereas the Code Llama series is trained predominantly on a near-deduplicated dataset of publicly available code, with a foundation established by Llama2.

An analysis of Table 6 also indicates that our approach remains within acceptable ranges in terms of both time and monetary expenditure. Out of the three types of LLMs, GPT-4 stands out with the highest monetary cost per code sample, averaging $0.0810. Nevertheless, it also boasts the best performance, achieving a $DFG_{accuracy}$ of 0.862. It’s important to mention that the Code Llama series carries no monetary expense, as it is freely accessible for both research and commercial purposes. When considering time expenditure, the time costs associated with our approach across the three types of LLMs also remain within an acceptable range. Among them, GPT-4 has the lowest time expenditure, averaging 102 seconds to process a code sample. The longest time is spent on Code Llama-70B, with an average of 316 seconds per code sample. This is because deploying the Code Llama series models locally involves loading and initializing the model and its associated components. In contrast, using GPT-3.5 and GPT-4 only requires sending requests and receiving responses, without the need for model loading and initialization steps.

(1) The outstanding performance across different LLMs demonstrates the generalizability of our approach. Furthermore, our approach exhibits the best performance on GPT-4. (2) Our approach’s expenditure, whether in terms of money or time, remains within acceptable bounds.

6 Discussion

6.1 Internal Threat to the Validity

The main internal threat with our approach is error propagation. Specifically, if an AI unit encounters an error or provides inaccurate results, it can propagate through subsequent units, ultimately leading to incorrect DFG. Despite the high accuracy of each individual AI unit in our experiments (see Section 5.1), the issue of error propagation persists. To address the issue of error propagation, we could draw inspiration from Fu et al. [59] and intend to incorporate the scoring and optimization mechanism [59] into our DFG-Chain. This mechanism aims to score and optimize the output of each AI unit, minimizing the potential for error propagation and enhancing the effectiveness of our approach. Nevertheless, achieving a comprehensive solution to tackle error propagation effectively requires further research and exploration.

6.2 External Threats to the Validity

DFG generation involves determining the programming language, its characteristics, and the LLM used to generate DFGs. However, there are external threats in all three aspects.

For the first threat, our current study focuses only on Python, but we plan to investigate other dynamically typed languages like Ruby, R, and PHP to evaluate the generalizability of our approach. This mostly requires changing prompt examples rather than requiring significant engineering and maintenance effort for different languages and versions.

For the second threat, we plan to consider language structures, such as iterators, besides comprehensions. For example, the LLM cannot directly predict the def-use flow information for “x” in the code “result = filter(lambda x: x % 2 == 1, lst).” To address this, we need to expand iterators into simple algebraic operations, such as loops, by including examples of this in the prompt.

The final threat is the potential for occasional output instability in LLMs [60, 61]. To minimize this impact, akin to prior research [62], we set the temperature parameter in the LLM configuration to zero, significantly enhancing the stability of our approach. Moreover, numerous studies indicate [63, 64] that structured prompts contribute to stable LLM outputs. Consequently, we plan to transform the natural language prompts in DFG-Chain into structured prompts in the future to further ensure the stability of LLM outputs.

6.3 Failure Case Study

As depicted in Figure 12(a1), this code snippet is sourced from Stack Overflow.¹² In the second line, a lambda expression is present. The primary objective of this code is to filter integers from a provided list, retaining only those divisible by five, and then storing them as strings in a separate list. To aid comprehension, we offer an alternative code representation using loop structures, depicted in Figure 12(a2). Because there is no specialized AI unit to capture the pattern of lambda expressions (i.e., lambda parameter: expression) and convert them into loops, our approach misses extracting the “def: $\{a\}$ ” shown in the orange section of Figure 12 while generating the corresponding DFG for this code.

Fig. 12.

However, addressing this issue is straightforward. We can accomplish it by creating a new AI unit that utilizes the robust pattern-learning ability of LLM. This AI unit is designed to identify the pattern present in lambda expressions (i.e., lambda parameter: expression) and subsequently transform these expressions into loops. On the other hand, incorporating heuristic rules into static analysis methods poses certain challenges, primarily due to the impossibility of exhaustively considering all potential rules. For example, consider the following lambda expressions: “add = lambda a, b: a + b” and “reverse_string = lambda s: s[::-1].” If heuristic rules are adopted, separate rules need to be devised for these two expressions because they involve different computational logic and processing methods. In contrast, by leveraging the robust pattern-learning capabilities of LLM, we can capture the common pattern shared by these two expressions: lambda parameter: expression. In essence, the expression patterns found in programming languages are finite [65], and our approach is adept at learning and capturing these finite patterns. However, the rules derived from patterns are infinite. Therefore, relying on manual heuristic rules in static analysis for DFG generation is impractical.

7 Related Work

7.1 Generating DFGs for Dynamically Typed Languages

Generating DFGs for dynamically typed languages like Python is more challenging than statically-typed languages like Java, mainly due to IDF resulting from dynamic features, such as short-circuit, lazy and delayed evaluation. Currently, two solutions could be used potentially to address IDF issues.

The first potential solution involves utilizing static analysis methods [44, 66] that analyze the code before running to identify issues based on static information within the code, such as variable types, function calls, and syntax structures. However, these methods fall short in detecting IDF issues resulting from dynamically generated expression calculations during runtime and cannot analyze the def-use flow information of variables. Although heuristic rules can be developed to infer the relationship between variable definitions and usages automatically, it requires significant human effort and cannot cover all possible scenarios.

The other potential solution involves leveraging LLMs, such as GPT-3 [24], CodeX [67], ChatGPT [68], which could be potentially used to identify and resolve IDF issues. Previous studies [69–73] demonstrate that LLMs can learn implicit patterns and rules from the code’s expressions, allowing them to comprehend the code’s semantics. However, it is worth noting that the primary focus of these studies is not on the generation of DFGs, but rather on other tasks, such as code representation and API relation extraction.

In contrast to previous studies [69–73], our focus lies on DFG generation, marking the first instance of such emphasis. Additionally, our approach differs from potential static analysis methods as it does not rely on explicit syntax and semantic information in the source code. Instead, it leverages deep features learned from a vast amount of program data, resulting in a more comprehensive understanding of the underlying causes and flow paths of IDF in the code.

7.2 Transferring LLMs to Downstream Tasks

Two approaches transfer LLMs to downstream tasks: supervised fine-tuning [73–75] and in-context learning [24, 30, 31]. Supervised fine-tuning aligns pre-training with downstream tasks using prompts, enabling strong few-shot learning. However, it struggles with complex tasks that require substantial labeling, such as DFG generation.

In-context learning conditions LLMs on task descriptions and demonstrations to generate answers, which has been used in various tasks, such as software testing [76] and code generation [77]. Previous works use direct-inquiry style prompts, which limit their ability to handle complex reasoning tasks. CoT is proposed to break down complex tasks into simple instructions, but it struggles with intricate tasks [35, 36]. Our approach, based on AI chain [39, 40, 78], involves interacting with LLMs in explicit steps to generate domain-specific flow graphs for complex tasks (e.g., DFG generation), providing a more thorough analysis than existing CoT works.

8 Conclusion and Future Work

In this article, we find that IDF problems underlie many issues stemming from the dynamic features of dynamically typed programming languages. To overcome this fundamental problem, we propose utilizing the in-context learning, language understanding, and pattern matching capabilities of LLMs to predict IDF during runtime and capture def-use flow information for variables. Our approach involves an informative CoT consisting of five steps and breaking down the CoT into an AI chain with multiple separate AI units to enhance the robustness and reliability of LLM outputs.

Our approach provides a novel alternative for developing software engineering tools, eliminating the need for significant engineering and maintenance effort. By leveraging foundation models, we can focus on identifying problems for AI to solve rather than spending time on data collection, labeling, model training, or program analysis. Overall, our approach offers a promising direction for the development of efficient software engineering tools.

In the future, we plan to explore our approach’s potential and broaden its application in software engineering domains like program repair and test cases generation. Additionally, we aim to apply our approach to partial dynamically typed code, further enhancing its versatility and applicability. Our data can be found here.¹³

Footnotes

https://openai.com/blog/introducing-chatgpt-and-whisper-apis

https://platform.openai.com/docs/models/gpt-3-5-turbo

https://numpy.org/doc/stable

⁴

https://docs.python.org/3/library/random.html

⁵

https://docs.python.org/3/library/ast.html

⁶

https://pypi.org/project/staticfg

⁷

https://github.com/coetaur0/staticfg

⁸

https://docs.python.org/3/library/ast.html

⁹

https://github.com/tree-sitter/py-tree-sitter

¹⁰

https://docs.python.org/3/library/dis.html

¹¹

https://github.com/Tiankai-Jiang/CFG-Generator

¹²

https://stackoverflow.com/questions/77124967

¹³

https://anonymous.4open.science/r/AI-Chain-on-LLMs-for-Predicting-Implicit-Data-Flows-to-Generate-DFGs-in-Dynamically-Typed-Code-C545/README.md

References

[1]

Hemant D. Pande and William Landi. 1991. Interprocedural def-use associations in C programs. In Proceedings of the Symposium on Testing, Analysis, and Verification. 139–153.

Abstract

1 Introduction

2 Problem Definition

3 Approach

3.1 Hierarchical Task Breakdown

3.2 Prompt Design for AI Units

3.2.1 Import Statements Completion Unit.

3.2.2 Comprehension Transformation Unit.

3.2.3 Program Slicing Unit.

3.2.4 Def-Use Extraction Unit.

3.2.5 Def-Use Flow Fusion Unit.

3.3 Running Example

4 Experimental Setup

4.1 Rqs

4.2 Data Preparation

4.3 Baselines

4.4 Evaluation Metrics

4.5 Large Language Models

4.5.1 GPT-3.5.

4.5.2 GPT-4.

4.5.3 Code Llama.

5 Experimental Results

5.1 RQ1: What Is the Individual Effectiveness of Each AI Unit within DFG-Chain?

5.1.1 Motivation.

5.1.2 Methodology.

5.1.3 Result Analysis.

5.2 RQ2: How Accurate Is the DFG Generation by DFG-Chain Compared to Baselines?

5.2.1 Motivation.

5.2.2 Methodology.

5.2.3 Result Analysis.

5.3 RQ3: What Is the Impact of Removing Individual Units and Zero-Shot Prompts on the Performance of the AI Chain in DFG Generation?

5.3.1 Motivation.

5.3.2 Methodology.

5.3.3 Result Analysis.

5.4 RQ4: How Does the Performance of DFG-Chain Generalize across Different LLMs?

5.4.1 Motivation.

5.4.2 Methodology.

5.4.3 Result Analysis.

6 Discussion

6.1 Internal Threat to the Validity

6.2 External Threats to the Validity

6.3 Failure Case Study

7 Related Work

7.1 Generating DFGs for Dynamically Typed Languages

7.2 Transferring LLMs to Downstream Tasks

8 Conclusion and Future Work

Footnotes

References

Index Terms

Recommendations

Interprocedural constant propagation: an empirical study

Interprocedural dataflow analysis in an executable optimizer

Dataflow analysis for concurrent programs using data-race detection

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations