Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3540250.3558925acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

PolyFax: a toolkit for characterizing multi-language software

Published: 09 November 2022 Publication History

Abstract

Today’s software systems are mostly developed in multiple languages (i.e., multi-language software), yet tool support for understanding and assuring these systems is rare. To facilitate future research on multi-language software engineering, this paper presents PolyFax, a toolkit that offers automated means for dataset collection from GitHub and two analysis utilities--a vulnerability-fixing commit categorization tool (VCC) and a language interfacing mechanism identification/categorization tool (LIC). The VCC tool immediately assists with assessing the vulnerability proneness of a given multi-language project based on its version histories, while the LIC tool enables dissection of the most important aspect of the construction of multi-language systems. Application of PolyFax to 7,113 multi-language projects with 12.6 million commits showed its practical usefulness in terms of promising efficiency and accuracy for studying multi-language software.

References

[1]
2020. categories of security vulnerabilities. https://cwe.mitre.org/top25/archive/2011/2011_cwe_sans_top25.pdf
[2]
2020. GitHub Developer: provides APIs to retrive or query repositories in GitHub. https://developer.github.com/v3
[3]
2020. NLTK: platform for building Python programs to work with human language data. https://www.nltk.org
[4]
2021. GitHub: a US-based global company, provides hosting for software development version control using Git. https://github.com/
[5]
Emery D Berger, Celeste Hollenbeck, Petr Maj, Olga Vitek, and Jan Vitek. 2019. On the impact of programming languages on code quality: a reproduction study. ACM Transactions on Programming Languages and Systems (TOPLAS), 41, 4 (2019), 1–24.
[6]
Tegawendé F Bissyandé, Ferdian Thung, David Lo, Lingxiao Jiang, and Laurent Réveillere. 2013. Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In 2013 IEEE 37th annual computer software and applications conference. 303–312.
[7]
Haipeng Cai and Barbara Ryder. 2017. DroidFax: A Toolkit for Systematic Characterization of Android Applications. In International Conference on Software Maintenance and Evolution (ICSME). 643–647.
[8]
Casey Casalnuovo, Yagnik Suchak, Baishakhi Ray, and Cindy Rubio-González. 2017. Gitcproc: A tool for processing and classifying github commits. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 396–399.
[9]
Adam Cohen. 2011. FuzzyWuzzy: Fuzzy string matching in python. ChairNerd Blog, 22 (2011).
[10]
Xiaoqin Fu and Haipeng Cai. 2019. A Dynamic Taint Analyzer for Distributed Systems. In ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1115–1119.
[11]
Xiaoqin Fu and Haipeng Cai. 2021. FlowDist: Multi-Staged Refinement-Based Dynamic Information Flow Analysis for Distributed Software Systems. In 30th USENIX Security Symposium (USENIX Security). 2093–2110. isbn:978-1-939133-24-3
[12]
Xiaoqin Fu, Haipeng Cai, and Li Li. 2020. Dads: Dynamic Slicing Continuously-Running Distributed Programs with Budget Constraints. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1566–1570.
[13]
Xiaoqin Fu, Boxiang Lin, and Haipeng Cai. 2022. DistFax: A Toolkit for Measuring Interprocess Communications and Quality of Distributed Systems. In IEEE/ACM International Conference on Software Engineering (ICSE). 51–55.
[14]
GitHub. 2020. The 2020 State of the OCTO——VERSE. https://octoverse.github.com/#project-spotlight-tensorflow
[15]
gRPC. 2020. gRPC Tutorial. https://grpc.io/docs/
[16]
Emma Haddi, Xiaohui Liu, and Yong Shi. 2013. The role of text pre-processing in sentiment analysis. Procedia Computer Science, 17 (2013), 26–32.
[17]
John Jenkins and Haipeng Cai. 2018. ICC-inspect: Supporting runtime inspection of Android inter-component communications. In Proceedings of the 5th International Conference on Mobile Software Engineering and Systems. 80–83.
[18]
Siim Karus and Harald Gall. 2011. A study of language usage evolution in open source software. In Proceedings of the 8th Working Conference on Mining Software Repositories. 13–22.
[19]
Frank Li and Vern Paxson. 2017. A large-scale empirical study of security patches. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2201–2215.
[20]
Wen Li. 2022. PolyFax code repository. https://github.com/Daybreak2019/PolyFax
[21]
Wen Li. 2022. PolyFax dataset. https://hub.docker.com/repository/docker/daybreak2019/fse22_vpomc
[22]
Wen Li, Li Li, and Haipeng Cai. 2022. On the Vulnerability Proneness of Multilingual Code. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).
[23]
Wen Li, Na Meng, Li Li, and Haipeng Cai. 2021. Understanding language selection in multi-language software projects on GitHub. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 256–257.
[24]
Wen Li, Jiang Ming, Xiapu Luo, and Haipeng Cai. 2022. PolyCruise: A Cross-Language Dynamic Information Flow Analysis. In 31st USENIX Security Symposium (USENIX Security 22). Boston, MA. 2513–2530. isbn:978-1-939133-31-1
[25]
Philip Mayer and Alexander Bauer. 2015. An empirical analysis of the utilization of multiple programming languages in open source projects. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering. 1–10.
[26]
MITRE. 2020. Common Weakness Enumeration. http://cwe.mitre.org/
[27]
Havoc Pennington. 2020. D-Bus Tutorial. https://dbus.freedesktop.org/doc/dbus-tutorial.html
[28]
Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 426–437.
[29]
Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 155–165.
[30]
Alex Villazón, Haiyang Sun, Andrea Rosà, Eduardo Rosales, Daniele Bonetta, Isabella Defilippis, Sergio Oporto, and Walter Binder. 2019. NAB: automated large-scale multi-language dynamic program analysis in public code repositories. In Proceedings Companion of the 2019 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 9–10.
[31]
Haoran Yang, Wen Li, and Haipeng Cai. 2022. Language-Agnostic Dynamic Analysis of Multilingual Code: Promises, Pitfalls, and Prospects. In ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).
[32]
Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, and Zhong Su. 2021. D2A: a dataset built for AI-based vulnerability detection methods using differential analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120.

Cited By

View all
  • (2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
  • (2024)How Are Multilingual Systems Constructed: Characterizing Language Use and Selection in Open-Source Multilingual SoftwareACM Transactions on Software Engineering and Methodology10.1145/363196733:3(1-46)Online publication date: 14-Mar-2024
  • (2024)VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability AnalysesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639116(1-13)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2022
1822 pages
ISBN:9781450394130
DOI:10.1145/3540250
This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-language vulnerability
  2. language interfacing
  3. multi-language software
  4. multilingual code
  5. regression analysis
  6. software security

Qualifiers

  • Research-article

Funding Sources

Conference

ESEC/FSE '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)370
  • Downloads (Last 6 weeks)35
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Learning to Detect and Localize Multilingual BugsProceedings of the ACM on Software Engineering10.1145/36608041:FSE(2190-2213)Online publication date: 12-Jul-2024
  • (2024)How Are Multilingual Systems Constructed: Characterizing Language Use and Selection in Open-Source Multilingual SoftwareACM Transactions on Software Engineering and Methodology10.1145/363196733:3(1-46)Online publication date: 14-Mar-2024
  • (2024)VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability AnalysesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639116(1-13)Online publication date: 20-May-2024
  • (2024)Multi-Language Software Development: Issues, Challenges, and SolutionsIEEE Transactions on Software Engineering10.1109/TSE.2024.335825850:3(512-533)Online publication date: Mar-2024
  • (2023)PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative FuzzingProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623166(1645-1659)Online publication date: 15-Nov-2023
  • (2023)VULGEN: Realistic Vulnerability Generation Via Pattern Mining and Deep LearningProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00211(2527-2539)Online publication date: 14-May-2023
  • (2023)Demystifying Issues, Challenges, and Solutions for Multilingual Software DevelopmentProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00157(1840-1852)Online publication date: 14-May-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media