Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code

Published: 01 March 2006 Publication History

Abstract

Recent studies have shown that large software suites contain significant amounts of replicated code. It is assumed that some of this replication is due to copy-and-paste activity and that a significant proportion of bugs in operating systems are due to copy-paste errors. Existing static code analyzers are either not scalable to large software suites or do not perform robustly where replicated code is modified with insertions and deletions. Furthermore, the existing tools do not detect copy-paste related bugs. In this paper, we propose a tool, CP-Miner, that uses data mining techniques to efficiently identify copy-pasted code in large software suites and detects copy-paste bugs. Specifically, it takes less than 20 minutes for CP-Miner to identify 190,000 copy-pasted segments in Linux and 150,000 in FreeBSD. Moreover, CP-Miner has detected many new bugs in popular operating systems, 49 in Linux and 31 in FreeBSD, most of which have since been confirmed by the corresponding developers and have been rectified in the following releases. In addition, we have found some interesting characteristics of copy-paste in operating system code. Specifically, we analyze the distribution of copy-pasted code by size (number lines of code), granularity (basic blocks and functions), and modification within copy-pasted code. We also analyze copy-paste across different modules and various software versions.

References

[1]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: A Tool for Finding Copy-Paste and Related Bugs in Operating System Code,” Proc. Symp. Operating System Design and Implementation, pp. 289-302, 2004.
[2]
B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software Systems,” Proc. Second Working Conf. Reverse Eng., p. 86, 1995.
[3]
S. Ducasse, M. Rieger, and S. Demeyer, “A Language Independent Approach for Detecting Duplicated Code,” Proc. Int'l Conf. Software Maintenance, pp. 109-118, 1999.
[4]
C. Kapser and M.W. Godfrey, “Toward a Taxonomy of Clones in Source Code: A Case Study,” Evolution of Large-Scale Industrial Software Applications, Sept. 2003.
[5]
T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code,” IEEE Trans. Software Eng., vol. 28, no. 7, pp. 654-670, July 2002.
[6]
I.D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. Int'l Conf. Software Maintenance, p. 368, 1998.
[7]
A. Chou, J. Yang, B. Chelf, S. Hallem, and D.R. Engler, “An Empirical Study of Operating System Errors,” Proc. Symp. Operating Systems Principles, pp. 73-88, 2001.
[8]
“Linux Kernel Mailing List,”
[9]
A. Chou, B. Chelf, D.R. Engler, and M. Heinrich, “Using Meta-Level Compilation to Check FLASH Protocol Code,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating System, pp. 59-70, 2000.
[10]
D. Engler and K. Ashcraft, “RacerX: Effective, Static Detection of Race Conditions and Deadlocks,” Proc. ACM Symp. Operating Systems Principles, pp. 237-252, 2003.
[11]
S. Hallem, B. Chelf, Y. Xie, and D. Engler, “A System and Language for Building System-Specific, Static Analyses,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 69-82, 2002.
[12]
M. Musuvathi, D. Park, A. Chou, D.R. Engler, and D.L. Dill, “CMC: A Pragmatic Approach to Model Checking Real Code,” Proc. Symp. Operating Systems Design and Implementation, Dec. 2002.
[13]
R. Hastings and B. Joyce, “Purify: Fast Detection of Memory Leaks and Access Errors,” Proc. Winter USENIX Conf., pp. 158-185, Dec. 1992.
[14]
N. Nethercote and J. Seward, “Valgrind: A Program Supervision Framework,” Proc. Third Workshop Runtime Verification, 2003.
[15]
J. Condit, M. Harren, S. McPeak, G.C. Necula, and W. Weimer, “CCured in the Real World,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 232-244, 2003.
[16]
S. Grier, “A Tool that Detects Plagiarism in Pascal Programs,” Proc. 12th SIGCSE Technical Symp. Computer Science Education, pp. 15-20, 1981.
[17]
H.T. Jankowitz, “Detecting Plagiarism in Student Pascal Programs,” Computer J., vol. 31, no. 1, pp. 1-8, 1988.
[18]
L. Prechelt, G. Malpohl, and M. Philippsen, “Finding Plagiarisms among a Set of Programs with JPlag,” J. Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, Nov. 2002.
[19]
A. Aiken, “Moss: A System for Detecting Software Plagiarism,”
[20]
S. Schleimer, D.S. Wilkerson, and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 76-85, 2003.
[21]
B.S. Baker, “A Program for Identifying Duplicated Code,” Computing Science and Statistics, vol. 24, pp. 49-57, 1992.
[22]
J.H. Johnson, “Substring Matching for Clone Detection and Change Tracking,” Proc. Int'l Conf. Software Maintenance, pp. 120-126, 1994.
[23]
J.H. Johnson, “Identifying Redundancy in Source Code Using Fingerprints,” Proc. Conf. Centre for Advanced Studies on Collaborative Research, Oct. 1993.
[24]
R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplication in Source Code,” Proc. Eighth Int'l Symp. Static Analysis, 2001.
[25]
K. Kontogiannis, M. Galler, and R. DeMori, “Detecting Code Similarity Using Patterns,” Working Notes Third Workshop AI and Software Eng.: Breaking the Toy Mold, 1995.
[26]
J. Krinke, “Identifying Similar Code with Program Dependence Graphs,” Proc. Eighth Working Conf. Reverse Eng., 2001.
[27]
J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics,” Proc. Int'l Conf. Software Maintenance, p. 244, 1996.
[28]
U. Manber, “Finding Similar Files in a Large File System,” Proc. USENIX Winter 1994 Technical Conf., pp. 1-10, 1994.
[29]
K.W. Church and J.I. Helfman, “Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code,” J. Computational and Graphical Statistics, 1993.
[30]
S. Hangal and M.S. Lam, “Tracking Down Software Bugs Using Automatic Anomaly Detection,” Proc. Int'l Conf. Software Eng., May 2002.
[31]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, “Eraser: A Dynamic Data Race Detector for Multithreaded Programs,” ACM Trans. Computer Systems, vol. 15, no. 4, pp. 391-411, 1997.
[32]
D. Engler, D.Y. Chen, and A. Chou, “Bugs as Inconsistent Behavior: A General Approach to Inferring Errors in Systems Code,” Proc. ACM Symp. Operating Systems Principles, pp. 57-72, 2001.
[33]
U. Stern and D.L. Dill, “Automatic Verification of the SCI Cache Coherence Protocol,” Proc. Conf. Correct Hardware Design and Verification Methods, pp. 21-34, 1995.
[34]
J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan, “Efficient and Precise Datarace Detection for Multithreaded Object-Oriented Programs,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 258-269, 2002.
[35]
R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int'l Conf. Data Eng., 1995.
[36]
X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proc. SIAM Int'l Conf. Data Mining, May 2003.
[37]
Z. Li, Z. Chen, S.M. Srinivasan, and Y. Zhou, “C-Miner: Mining Block Correlations in Storage Systems,” Proc. Third USENIX Conf. File and Storage Technologies, pp. 173-186, 2004.
[38]
A.V. Aho, R. Sethi, and J. Ullman, Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986.
[39]
R.E. Johnson and W.F. Opdyke, “Refactoring and Aggregation,” Proc. Int'l Symp. Object Technologies for Advanced Software, pp. 264-278, 1993.

Cited By

View all
  • (2024)Dataset: Copy-based Reuse in Open Source SoftwareProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644868(42-47)Online publication date: 15-Apr-2024
  • (2024)DSFM: Enhancing Functional Code Clone Detection with Deep Subtree InteractionsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639215(1-12)Online publication date: 20-May-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering
IEEE Transactions on Software Engineering  Volume 32, Issue 3
March 2006
75 pages

Publisher

IEEE Press

Publication History

Published: 01 March 2006

Author Tags

  1. Software analysis
  2. code duplication
  3. code reuse
  4. data mining.
  5. debugging aids

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Dataset: Copy-based Reuse in Open Source SoftwareProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644868(42-47)Online publication date: 15-Apr-2024
  • (2024)DSFM: Enhancing Functional Code Clone Detection with Deep Subtree InteractionsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639215(1-12)Online publication date: 20-May-2024
  • (2024)CNEPS: A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source ComponentsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639209(1-12)Online publication date: 20-May-2024
  • (2024)RecurScan: Detecting Recurring Vulnerabilities in PHP Web ApplicationsProceedings of the ACM Web Conference 202410.1145/3589334.3645530(1746-1755)Online publication date: 13-May-2024
  • (2024)Enhancing vulnerability detection via AST decomposition and neural sub-tree encodingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121865238:PBOnline publication date: 27-Feb-2024
  • (2023)A Survey of Tool Support for Working with Design Decisions in CodeACM Computing Surveys10.1145/360786856:2(1-37)Online publication date: 10-Jul-2023
  • (2023)Graph-of-Code: Semantic Clone Detection Using Graph FingerprintsIEEE Transactions on Software Engineering10.1109/TSE.2023.327678049:8(3972-3988)Online publication date: 1-Aug-2023
  • (2023)Challenging Machine Learning-Based Clone Detectors via Semantic-Preserving Code TransformationsIEEE Transactions on Software Engineering10.1109/TSE.2023.324011849:5(3052-3070)Online publication date: 1-May-2023
  • (2023)DiffSearch: A Scalable and Precise Search Engine for Code ChangesIEEE Transactions on Software Engineering10.1109/TSE.2022.321885949:4(2366-2380)Online publication date: 19-Apr-2023
  • (2023)Almost Rerere: Learning to Resolve Conflicts in Distributed ProjectsIEEE Transactions on Software Engineering10.1109/TSE.2022.321528949:4(2255-2271)Online publication date: 1-Apr-2023
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media