- arXiv.org
- Geophysics
- General Physics
- Optics
- Physics and Society
- Computational Physics
- Medical Physics
- History and Philosophy of Physics
- Space Physics
- Instrumentation and Detectors
- Fluid Dynamics
- Atomic Physics
- Chemical Physics
- Atmospheric and Oceanic Physics
- Applied Physics
- Plasma Physics
- Atomic and Molecular Clusters
- Biological Physics
- Classical Physics
- Accelerator Physics
- Data Analysis, Statistics and Probability
- Physics Education
- Popular Physics
- Metric Geometry
- Algebraic Geometry
- Number Theory
- Spectral Theory
- Quantum Algebra
- Differential Geometry
- Combinatorics
- Numerical Analysis
- Complex Variables
- Representation Theory
- Classical Analysis and ODEs
- History and Overview
- Probability
- Logic
- Information Theory
- Optimization and Control
- Dynamical Systems
- Functional Analysis
- Category Theory
- Analysis of PDEs
- Rings and Algebras
- General Topology
- Commutative Algebra
- Geometric Topology
- Mathematical Physics
- K-Theory and Homology
- Statistics Theory
- Operator Algebras
- Group Theory
- General Mathematics
- Algebraic Topology
- Symplectic Geometry
- Computational Complexity
- Computation and Language
- Computer Vision and Pattern Recognition
- Data Structures and Algorithms
- Multiagent Systems
- Discrete Mathematics
- Mathematical Software
- Machine Learning
- Numerical Analysis
- Multimedia
- Sound
- Social and Information Networks
- Graphics
- Artificial Intelligence
- Computers and Society
- Robotics
- Systems and Control
- Information Theory
- Other Computer Science
- Software Engineering
- Information Retrieval
- Computer Science and Game Theory
- Databases
- Formal Languages and Automata Theory
- Emerging Technologies
- Distributed, Parallel, and Cluster Computing
- Cryptography and Security
- Networking and Internet Architecture
- Logic in Computer Science
- Performance
- Operating Systems
- Neural and Evolutionary Computing
- Human-Computer Interaction
- General Literature
- Computational Geometry
- Computational Engineering, Finance, and Science
- Hardware Architecture
- Programming Languages
- Digital Libraries
- Symbolic Computation
Software Engineering (cs.SE)
- Nov 13 2024 cs.SE arXiv:2411.07982v1While the observability framework Kieker has a low overhead for tracing, its results currently cannot be used in most analysis tools due to lack of interoperability of the data formats. The OpenTelemetry standard aims for standardizing observability data. In this work, we describe how to export Kieker distributed tracing data to OpenTelemetry. This is done using the pipe-and-filter framework TeeTime. For TeeTime, a stage was defined that uses Kieker execution data, which can be created from most record types. We demonstrate the usability of our approach by visualizing trace data of TeaStore in the ExplorViz visualization tool.
- With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.
- Smart contracts, primarily written in Solidity, are integral to blockchain software applications, yet precise analysis and maintenance are hindered by the limitations of existing differencing tools. We introduce SoliDiffy, a novel Abstract Syntax Tree (AST) differencing tool specifically designed for Solidity. SoliDiffy enables fine-grained analysis by generating accurate and concise edit scripts of smart contracts, making it ideal for downstream tasks such as vulnerability detection, automated code repair, and code reviews. Our comprehensive evaluation on a large dataset of real-world Solidity contracts demonstrates that SoliDiffy delivers shorter and more precise edit scripts compared to state-of-the-art tools, while performing consistently in complex contract modifications. SoliDiffy is made publicly available at https://github.com/mojtaba-eshghie/SoliDiffy.
- Nov 13 2024 cs.SE arXiv:2411.07668v1In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process (e.g., for data annotation) or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of what our community standards are for high-quality empirical studies involving LLMs.
- ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.
- Nov 13 2024 cs.SE arXiv:2411.07480v1Reusing third-party libraries increases productivity and saves time and costs for developers. However, the downside is the presence of vulnerabilities in those libraries, which can lead to catastrophic outcomes. For instance, Apache Log4J was found to be vulnerable to remote code execution attacks. A total of more than 35,000 packages were forced to update their Log4J libraries with the latest version. Although several studies have been conducted to predict software vulnerabilities, the prediction does not cover the vulnerabilities found in third-party libraries. Even if the developers are aware of the forthcoming issue, replicating a function similar to the libraries would be time-consuming and labour-intensive. Nevertheless, it is practically reasonable for software developers to update their third-party libraries (and dependencies) whenever the software vendors have released a vulnerable-free version. In this work, our manual study focuses on the real-world practices (crowd reaction) adopted by software vendors and developer communities when a vulnerability is disclosed. We manually investigated 312 CVEs and identified that the primary trend of vulnerability handling is to provide a fix before publishing an announcement. Otherwise, developers wait an average of 10 days for a fix if it is unavailable upon the announcement. Additionally, the crowd reaction is oblivious to the vulnerability severity. In particular, we identified Oracle as the most vibrant community diligent in releasing fixes. Their software developers also actively participate in the associated vulnerability announcements.
- With the wave of high-profile supply chain attacks targeting development and client organizations, supply chain security has recently become a focal point. As a result, there is an elevated discussion on securing the development environment and increasing the transparency of the third-party code that runs in software products to minimize any negative impact from third-party code in a software product. However, the literature on secure software development lacks insight into how the third-party development tools used by every developer affect the security posture of the developer, the development organization, and, eventually, the end product. To that end, we have analyzed 52,880 third-party VS Code extensions to understand their threat to the developer, the code, and the development organizations. We found that ~5.6\% of the analyzed extensions have suspicious behavior, jeopardizing the integrity of the development environment and potentially leaking sensitive information on the developer's product. We also found that the VS Code hosting the third-party extensions lacks practical security controls and lets untrusted third-party code run unchecked and with questionable capabilities. We offer recommendations on possible avenues for fixing some of the issues uncovered during the analysis.
- With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.
- Nov 13 2024 cs.SE arXiv:2411.07360v1Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME.
- This paper investigates the use of the ASTD language for ensemble anomaly detection in data logs. It uses a sliding window technique for continuous learning in data streams, coupled with updating learning models upon the completion of each window to maintain accurate detection and align with current data trends. It proposes ASTD patterns for combining learning models, especially in the context of unsupervised learning, which is commonly used for data streams. To facilitate this, a new ASTD operator is proposed, the Quantified Flow, which enables the seamless combination of learning models while ensuring that the specification remains concise. Our contribution is a specification pattern, highlighting the capacity of ASTDs to abstract and modularize anomaly detection systems. The ASTD language provides a unique approach to develop data flow anomaly detection systems, grounded in the combination of processes through the graphical representation of the language operators. This simplifies the design task for developers, who can focus primarily on defining the functional operations that constitute the system.
- Context: Requirements Engineering for AI-based systems (RE4AI) presents unique challenges due to the inherent volatility and complexity of AI technologies, necessitating the development of specialized methodologies. It is crucial to prepare upcoming software engineers with the abilities to specify high-quality requirements for AI-based systems. Goal: This research aims to evaluate the effectiveness and applicability of Goal-Oriented Requirements Engineering (GORE), specifically the KAOS method, in facilitating requirements elicitation for AI-based systems within an educational context. Method: We conducted an empirical study in an introductory software engineering class, combining presentations, practical exercises, and a survey to assess students' experience using GORE. Results: The analysis revealed that GORE is particularly effective in capturing high-level requirements, such as user expectations and system necessity. However, it is less effective for detailed planning, such as ensuring privacy and handling errors. The majority of students were able to apply the KAOS methodology correctly or with minor inadequacies, indicating its usability and effectiveness in educational settings. Students identified several benefits of GORE, including its goal-oriented nature and structured approach, which facilitated the management of complex requirements. However, challenges such as determining goal refinement stopping criteria and managing diagram complexity were also noted. Conclusion: GORE shows significant potential for enhancing requirements elicitation in AI-based systems. While generally effective, the approach could benefit from additional support and resources to address identified challenges. These findings suggest that GORE can be a valuable tool in both educational and practical contexts, provided that enhancements are made to facilitate its application.
- Objective: To design and implement VIEWER, a versatile toolkit for visual analytics of clinical data, and to systematically evaluate its effectiveness across various clinical applications while gathering feedback for iterative improvements. Materials and Methods: VIEWER is an open-source and extensible toolkit that employs distributed natural language processing and interactive visualisation techniques to facilitate the rapid design, development, and deployment of clinical information retrieval, analysis, and visualisation at the point of care. Through an iterative and collaborative participatory design approach, VIEWER was designed and implemented in a large mental health institution, where its clinical utility and effectiveness were assessed using both quantitative and qualitative methods. Results: VIEWER provides interactive, problem-focused, and comprehensive views of longitudinal patient data from a combination of structured clinical data and unstructured clinical notes. Despite a relatively short adoption period and users' initial unfamiliarity, VIEWER significantly improved performance and task completion speed compared to the standard clinical information system. Users and stakeholders reported high satisfaction and expressed strong interest in incorporating VIEWER into their daily practice. Discussion: VIEWER provides a cost-effective enhancement to the functionalities of standard clinical information systems, with evaluation offering valuable feedback for future improvements. Conclusion: VIEWER was developed to improve data accessibility and representation across various aspects of healthcare delivery, including population health management and patient monitoring. The deployment of VIEWER highlights the benefits of collaborative refinement in optimizing health informatics solutions for enhanced patient care.
Recent comments
Practical application-specific advantage through hybrid quantum computing
Ryan Babbush May 11 2022 21:11 UTCNoon van der Silk Jul 25 2019 04:45 UTC
Most unexpected paper title of the year?
Luis Cruz Mar 16 2018 15:34 UTC
Related Work:
- [Performance-Based Guidelines for Energy Efficient Mobile Applications](http://ieeexplore.ieee.org/document/7972717/)
- [Leafactor: Improving Energy Efficiency of Android Apps via Automatic Refactoring](http://ieeexplore.ieee.org/document/7972807/)
- Supported by Braneshop, and the US National Science Foundation.