Software Engineering (cs.SE)

Interoperability From Kieker to OpenTelemetry: Demonstrated as Export to ExplorViz
David Georg Reichelt, Malte Hansen, Shinhyung Yang, Wilhelm Hasselbring
Nov 13 2024 cs.SE arXiv:2411.07982v1

@misc{2411.07982, author = {David Georg Reichelt and Malte Hansen and Shinhyung Yang and Wilhelm Hasselbring}, title = {{I}nteroperability {F}rom {K}ieker to {O}pen{T}elemetry: {D}emonstrated as {E}xport to {E}xplor{V}iz}, year = {2024}, eprint = {2411.07982}, note = {arXiv:2411.07982v1} }
PDF
While the observability framework Kieker has a low overhead for tracing, its results currently cannot be used in most analysis tools due to lack of interoperability of the data formats. The OpenTelemetry standard aims for standardizing observability data. In this work, we describe how to export Kieker distributed tracing data to OpenTelemetry. This is done using the pipe-and-filter framework TeeTime. For TeeTime, a stage was defined that uses Kieker execution data, which can be created from most record types. We demonstrate the usability of our approach by visualizing trace data of TeaStore in the ExplorViz visualization tool.
RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li
Nov 13 2024 cs.SE cs.AI arXiv:2411.07781v1

@misc{2411.07781, author = {Chengquan Guo and Xun Liu and Chulin Xie and Andy Zhou and Yi Zeng and Zinan Lin and Dawn Song and Bo Li}, title = {{R}ed{C}ode: {R}isky {C}ode {E}xecution and {G}eneration {B}enchmark for {C}ode {A}gents}, year = {2024}, eprint = {2411.07781}, note = {arXiv:2411.07781v1} }
PDF
With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.
SoliDiffy: AST Differencing for Solidity Smart Contracts
Mojtaba Eshghie, Viktor Åryd, Martin Monperrus, Cyrille Artho
Nov 13 2024 cs.SE cs.PL arXiv:2411.07718v1

@misc{2411.07718, author = {Mojtaba Eshghie and Viktor Åryd and Martin Monperrus and Cyrille Artho}, title = {{S}oli{D}iffy: {AST} {D}ifferencing for {S}olidity {S}mart {C}ontracts}, year = {2024}, eprint = {2411.07718}, note = {arXiv:2411.07718v1} }
PDF
Smart contracts, primarily written in Solidity, are integral to blockchain software applications, yet precise analysis and maintenance are hindered by the limitations of existing differencing tools. We introduce SoliDiffy, a novel Abstract Syntax Tree (AST) differencing tool specifically designed for Solidity. SoliDiffy enables fine-grained analysis by generating accurate and concise edit scripts of smart contracts, making it ideal for downstream tasks such as vulnerability detection, automated code repair, and code reviews. Our comprehensive evaluation on a large dataset of real-world Solidity contracts demonstrates that SoliDiffy delivers shorter and more precise edit scripts compared to state-of-the-art tools, while performing consistently in complex contract modifications. SoliDiffy is made publicly available at https://github.com/mojtaba-eshghie/SoliDiffy.
Towards Evaluation Guidelines for Empirical Studies involving LLMs
Stefan Wagner, Marvin Muñoz Barón, Davide Falessi, Sebastian Baltes
Nov 13 2024 cs.SE arXiv:2411.07668v1

@misc{2411.07668, author = {Stefan Wagner and Marvin Muñoz Barón and Davide Falessi and Sebastian Baltes}, title = {{T}owards {E}valuation {G}uidelines for {E}mpirical {S}tudies involving {LLM}s}, year = {2024}, eprint = {2411.07668}, note = {arXiv:2411.07668v1} }
PDF
In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process (e.g., for data annotation) or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of what our community standards are for high-quality empirical studies involving LLMs.
Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis
Minda Li, Bhaskar Krishnamachari
Nov 13 2024 cs.SE cs.AI arXiv:2411.07529v1

@misc{2411.07529, author = {Minda Li and Bhaskar Krishnamachari}, title = {{E}valuating {C}hat{GPT}-3.5 {E}fficiency in {S}olving {C}oding {P}roblems of {D}ifferent {C}omplexity {L}evels: {A}n {E}mpirical {A}nalysis}, year = {2024}, eprint = {2411.07529}, note = {arXiv:2411.07529v1} }
PDF
ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.
Discovery of Timeline and Crowd Reaction of Software Vulnerability Disclosures
Yi Wen Heng, Zeyang Ma, Haoxiang Zhang, Zhenhao Li, Tse-Hsun Chen
Nov 13 2024 cs.SE arXiv:2411.07480v1

@misc{2411.07480, author = {Yi Wen Heng and Zeyang Ma and Haoxiang Zhang and Zhenhao Li and Tse-Hsun Chen}, title = {{D}iscovery of {T}imeline and {C}rowd {R}eaction of {S}oftware {V}ulnerability {D}isclosures}, year = {2024}, eprint = {2411.07480}, note = {arXiv:2411.07480v1} }
PDF
Reusing third-party libraries increases productivity and saves time and costs for developers. However, the downside is the presence of vulnerabilities in those libraries, which can lead to catastrophic outcomes. For instance, Apache Log4J was found to be vulnerable to remote code execution attacks. A total of more than 35,000 packages were forced to update their Log4J libraries with the latest version. Although several studies have been conducted to predict software vulnerabilities, the prediction does not cover the vulnerabilities found in third-party libraries. Even if the developers are aware of the forthcoming issue, replicating a function similar to the libraries would be time-consuming and labour-intensive. Nevertheless, it is practically reasonable for software developers to update their third-party libraries (and dependencies) whenever the software vendors have released a vulnerable-free version. In this work, our manual study focuses on the real-world practices (crowd reaction) adopted by software vendors and developer communities when a vulnerability is disclosed. We manually investigated 312 CVEs and identified that the primary trend of vulnerability handling is to provide a fix before publishing an announcement. Otherwise, developers wait an average of 10 days for a fix if it is unavailable upon the announcement. Additionally, the crowd reaction is oblivious to the vulnerability severity. In particular, we identified Oracle as the most vibrant community diligent in releasing fixes. Their software developers also actively participate in the associated vulnerability announcements.
Developers Are Victims Too : A Comprehensive Analysis of The VS Code Extension Ecosystem
Shehan Edirimannage, Charitha Elvitigala, Asitha Kottahachchi Kankanamge Don, Wathsara Daluwatta, Primal Wijesekara, Ibrahim Khalil
Nov 13 2024 cs.CR cs.SE arXiv:2411.07479v1

@misc{2411.07479, author = {Shehan Edirimannage and Charitha Elvitigala and Asitha Kottahachchi Kankanamge Don and Wathsara Daluwatta and Primal Wijesekara and Ibrahim Khalil}, title = {{D}evelopers {A}re {V}ictims {T}oo : {A} {C}omprehensive {A}nalysis of {T}he {VS} {C}ode {E}xtension {E}cosystem}, year = {2024}, eprint = {2411.07479}, note = {arXiv:2411.07479v1} }
PDF
With the wave of high-profile supply chain attacks targeting development and client organizations, supply chain security has recently become a focal point. As a result, there is an elevated discussion on securing the development environment and increasing the transparency of the third-party code that runs in software products to minimize any negative impact from third-party code in a software product. However, the literature on secure software development lacks insight into how the third-party development tools used by every developer affect the security posture of the developer, the development organization, and, eventually, the end product. To that end, we have analyzed 52,880 third-party VS Code extensions to understand their threat to the developer, the code, and the development organizations. We found that ~5.6\% of the analyzed extensions have suspicious behavior, jeopardizing the integrity of the development environment and potentially leaking sensitive information on the developer's product. We also found that the VS Code hosting the third-party extensions lacks practical security controls and lets untrusted third-party code run unchecked and with questionable capabilities. We offer recommendations on possible avenues for fixing some of the issues uncovered during the analysis.
Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews
Aakash Sorathiya, Gouri Ginde
Nov 13 2024 cs.CL cs.AI cs.SE arXiv:2411.07398v1

@misc{2411.07398, author = {Aakash Sorathiya and Gouri Ginde}, title = {{B}eyond {K}eywords: {A} {C}ontext-based {H}ybrid {A}pproach to {M}ining {E}thical {C}oncern-related {A}pp {R}eviews}, year = {2024}, eprint = {2411.07398}, note = {arXiv:2411.07398v1} }
PDF
With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.
ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?
Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang
Nov 13 2024 cs.SE arXiv:2411.07360v1

@misc{2411.07360, author = {Salma Begum Tamanna and Gias Uddin and Song Wang and Lan Xia and Longyu Zhang}, title = {{C}hat{GPT} {I}naccuracy {M}itigation during {T}echnical {R}eport {U}nderstanding: {A}re {W}e {T}here {Y}et?}, year = {2024}, eprint = {2411.07360}, howpublished = {47th IEEE/ACM International Conference on Software Engineering (ICSE 2025)}, note = {arXiv:2411.07360v1} }
PDF
Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME.
ASTD Patterns for Integrated Continuous Anomaly Detection In Data Logs
Chaymae El Jabri, Marc Frappier, Pierre-Martin Tardif
Nov 13 2024 cs.SE cs.LG arXiv:2411.07272v1

@misc{2411.07272, author = {Chaymae El Jabri and Marc Frappier and Pierre-Martin Tardif}, title = {{ASTD} {P}atterns for {I}ntegrated {C}ontinuous {A}nomaly {D}etection {I}n {D}ata {L}ogs}, year = {2024}, eprint = {2411.07272}, note = {arXiv:2411.07272v1} }
PDF
This paper investigates the use of the ASTD language for ensemble anomaly detection in data logs. It uses a sliding window technique for continuous learning in data streams, coupled with updating learning models upon the completion of each window to maintain accurate detection and align with current data trends. It proposes ASTD patterns for combining learning models, especially in the context of unsupervised learning, which is commonly used for data streams. To facilitate this, a new ASTD operator is proposed, the Quantified Flow, which enables the seamless combination of learning models while ensuring that the specification remains concise. Our contribution is a specification pattern, highlighting the capacity of ASTDs to abstract and modularize anomaly detection systems. The ASTD language provides a unique approach to develop data flow anomaly detection systems, grounded in the combination of processes through the graphical representation of the language operators. This simplifies the design task for developers, who can focus primarily on defining the functional operations that constitute the system.
Teaching Requirements Engineering for AI: A Goal-Oriented Approach in Software Engineering Courses
Beatriz Batista, Márcia Lima, Tayana Conte
Nov 13 2024 cs.CY cs.SE arXiv:2411.07250v1

@misc{2411.07250, author = {Beatriz Batista and Márcia Lima and Tayana Conte}, title = {{T}eaching {R}equirements {E}ngineering for {AI}: {A} {G}oal-{O}riented {A}pproach in {S}oftware {E}ngineering {C}ourses}, year = {2024}, eprint = {2411.07250}, doi = {10.1145/3701625.3701686}, note = {arXiv:2411.07250v1} }
PDF
Context: Requirements Engineering for AI-based systems (RE4AI) presents unique challenges due to the inherent volatility and complexity of AI technologies, necessitating the development of specialized methodologies. It is crucial to prepare upcoming software engineers with the abilities to specify high-quality requirements for AI-based systems. Goal: This research aims to evaluate the effectiveness and applicability of Goal-Oriented Requirements Engineering (GORE), specifically the KAOS method, in facilitating requirements elicitation for AI-based systems within an educational context. Method: We conducted an empirical study in an introductory software engineering class, combining presentations, practical exercises, and a survey to assess students' experience using GORE. Results: The analysis revealed that GORE is particularly effective in capturing high-level requirements, such as user expectations and system necessity. However, it is less effective for detailed planning, such as ensuring privacy and handling errors. The majority of students were able to apply the KAOS methodology correctly or with minor inadequacies, indicating its usability and effectiveness in educational settings. Students identified several benefits of GORE, including its goal-oriented nature and structured approach, which facilitated the management of complex requirements. However, challenges such as determining goal refinement stopping criteria and managing diagram complexity were also noted. Conclusion: GORE shows significant potential for enhancing requirements elicitation in AI-based systems. While generally effective, the approach could benefit from additional support and resources to address identified challenges. These findings suggest that GORE can be a valuable tool in both educational and practical contexts, provided that enhancements are made to facilitate its application.
VIEWER: an extensible visual analytics framework for enhancing mental healthcare
Tao Wang, David Codling, Yamiko Msosa, Matthew Broadbent, Daisy Kornblum, Catherine Polling, Thomas Searle, Claire Delaney-Pope, Barbara Arroyo, Stuart MacLellan, Zoe Keddie, Mary Docherty, Angus Roberts, Robert Stewart, Richard Dobson, Robert Harland
Nov 13 2024 cs.HC cs.SE arXiv:2411.07247v1

@misc{2411.07247, author = {Tao Wang and David Codling and Yamiko Msosa and Matthew Broadbent and Daisy Kornblum and Catherine Polling and Thomas Searle and Claire Delaney-Pope and Barbara Arroyo and Stuart MacLellan and Zoe Keddie and Mary Docherty and Angus Roberts and Robert Stewart and Richard Dobson and Robert Harland}, title = {{VIEWER}: an extensible visual analytics framework for enhancing mental healthcare}, year = {2024}, eprint = {2411.07247}, note = {arXiv:2411.07247v1} }
PDF
Objective: To design and implement VIEWER, a versatile toolkit for visual analytics of clinical data, and to systematically evaluate its effectiveness across various clinical applications while gathering feedback for iterative improvements. Materials and Methods: VIEWER is an open-source and extensible toolkit that employs distributed natural language processing and interactive visualisation techniques to facilitate the rapid design, development, and deployment of clinical information retrieval, analysis, and visualisation at the point of care. Through an iterative and collaborative participatory design approach, VIEWER was designed and implemented in a large mental health institution, where its clinical utility and effectiveness were assessed using both quantitative and qualitative methods. Results: VIEWER provides interactive, problem-focused, and comprehensive views of longitudinal patient data from a combination of structured clinical data and unstructured clinical notes. Despite a relatively short adoption period and users' initial unfamiliarity, VIEWER significantly improved performance and task completion speed compared to the standard clinical information system. Users and stakeholders reported high satisfaction and expressed strong interest in incorporating VIEWER into their daily practice. Discussion: VIEWER provides a cost-effective enhancement to the functionalities of standard clinical information systems, with evaluation offering valuable feedback for future improvements. Conclusion: VIEWER was developed to improve data accessibility and representation across various aspects of healthcare delivery, including population health management and patient monitoring. The deployment of VIEWER highlights the benefits of collaborative refinement in optimizing health informatics solutions for enhanced patient care.

Recent comments