-
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Authors:
Daoguang Zan,
Zhirong Huang,
Ailun Yu,
Shaoxin Lin,
Yifan Shi,
Wei Liu,
Dong Chen,
Zongshuai Qi,
Hao Yu,
Lei Yu,
Dezhi Ran,
Muhan Zeng,
Bo Shen,
Pan Bian,
Guangtai Liang,
Bei Guan,
Pengjie Huang,
Tao Xie,
Yongji Wang,
Qianxiang Wang
Abstract:
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in…
▽ More
GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
CodeR: Issue Resolving with Multi-Agent and Task Graphs
Authors:
Dong Chen,
Shaoxin Lin,
Muhan Zeng,
Daoguang Zan,
Jian-Gang Wang,
Anton Cheshkov,
Jun Sun,
Hao Yu,
Guoliang Dong,
Artem Aliev,
Jie Wang,
Xiao Cheng,
Guangtai Liang,
Yuchi Ma,
Pan Bian,
Tao Xie,
Qianxiang Wang
Abstract:
GitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench is proposed to measure the performance in resolving issues. In this paper, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve 28.33% of issue…
▽ More
GitHub issue resolving recently has attracted significant attention from academia and industry. SWE-bench is proposed to measure the performance in resolving issues. In this paper, we propose CodeR, which adopts a multi-agent framework and pre-defined task graphs to Repair & Resolve reported bugs and add new features within code Repository. On SWE-bench lite, CodeR is able to solve 28.33% of issues, when submitting only once for each issue. We examine the performance impact of each design of CodeR and offer insights to advance this research direction.
△ Less
Submitted 10 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Deep Text Classification Can be Fooled
Authors:
Bin Liang,
Hongcheng Li,
Miaoqiang Su,
Pan Bian,
Xirong Li,
Wenchang Shi
Abstract:
In this paper, we present an effective method to craft text adversarial samples, revealing one important yet underestimated fact that DNN-based text classifiers are also prone to adversarial sample attack. Specifically, confronted with different adversarial scenarios, the text items that are important for classification are identified by computing the cost gradients of the input (white-box attack)…
▽ More
In this paper, we present an effective method to craft text adversarial samples, revealing one important yet underestimated fact that DNN-based text classifiers are also prone to adversarial sample attack. Specifically, confronted with different adversarial scenarios, the text items that are important for classification are identified by computing the cost gradients of the input (white-box attack) or generating a series of occluded test samples (black-box attack). Based on these items, we design three perturbation strategies, namely insertion, modification, and removal, to generate adversarial samples. The experiment results show that the adversarial samples generated by our method can successfully fool both state-of-the-art character-level and word-level DNN-based text classifiers. The adversarial samples can be perturbed to any desirable classes without compromising their utilities. At the same time, the introduced perturbation is difficult to be perceived.
△ Less
Submitted 7 January, 2019; v1 submitted 26 April, 2017;
originally announced April 2017.