research-article

Learning to detect table clones in spreadsheets

Authors:

Bo YangAuthors Info & Claims

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 528 - 540

https://doi.org/10.1145/3395363.3397384

Published: 18 July 2020 Publication History

Abstract

In order to speed up spreadsheet development productivity, end users can create a spreadsheet table by copying and modifying an existing one. These two tables share the similar computational semantics, and form a table clone. End users may modify the tables in a table clone, e.g., adding new rows and deleting columns, thus introducing structure changes into the table clone. Our empirical study on real-world spreadsheets shows that about 58.5% of table clones involve structure changes. However, existing table clone detection approaches in spreadsheets can only detect table clones with the same structures. Therefore, many table clones with structure changes cannot be detected.

We observe that, although the tables in a table clone may be modified, they usually share the similar structures and formats, e.g., headers, formulas and background colors. Based on this observation, we propose LTC (Learning to detect Table Clones), to automatically detect table clones with or without structure changes. LTC utilizes the structure and format information from labeled table clones and non table clones to train a binary classifier. LTC first identifies tables in spreadsheets, and then uses the trained binary classifier to judge whether every two tables can form a table clone. Our experiments on real-world spreadsheets from the EUSES and Enron corpora show that, LTC can achieve a precision of 97.8% and recall of 92.1% in table clone detection, significantly outperforming the state-of-the-art technique (a precision of 37.5% and recall of 11.1%).

References

[1]

2007. scikit-learn: Machine learning in Python. Retrieved Jan 15, 2020 from https://scikit-learn.org

[2]

2020. Apache POI-the Java API for Microsoft Documents. Retrieved Jan 15, 2020 from https://poi.apache.org/

[3]

2020. Ideas in Excel. Retrieved January 15, 2020 from https://support.ofice.com/enie/article/ideas-in-excel-3223aab8-f543-4fda-85ed-76bb0295fc4

[4]

2020. Power BI | Interactive Data Visualization BI Tools. Retrieved Jan 15, 2020 from https://powerbi.microsoft.com

[5]

2020. Weka 3: Machine Learning Software in Java. Retrieved Jan 15, 2020 from http://www.cs.waikato.ac.nz/ml/weka

[6]

Robin Abraham and Martin Erwig. 2004. Header and unit inference for spreadsheets through spatial analyses. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 165-172.

Digital Library

[7]

Robin Abraham and Martin Erwig. 2007. UCheck: A spreadsheet type checker for end users. Journal of Visual Languages and Computing 18, 1 ( 2007 ), 71-95.

Digital Library

[8]

Robin Abraham, Martin Erwig, Steve Kollmansberger, and Ethan Seifert. 2005. Visual specifications of correct spreadsheets. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 189-196.

Digital Library

[9]

Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of Working Conference on Reverse Engineering (WCRE). 86-95.

[10]

Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. Fuse: A reproducible, extendable, internet-scale corpus of spreadsheets. In Proceedings of Working Conference on Mining Software Repositories (MSR). 486-489.

[11]

Daniel W. Barowy, Emery D. Berger, and Benjamin Zorn. 2018. ExceLint: Automatically finding spreadsheet formula errors. In Proceedings of International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 148 : 1-148 : 26.

Digital Library

[12]

Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings of International Conference on Software Maintenance (ICSM). 368-377.

[13]

Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering (TSE) 33, 9 ( 2007 ), 577-591.

Digital Library

[14]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 ( 2001 ), 5-32.

[15]

Zhe Chen and Michael Cafarella. 2014. Integrating spreadsheet data via accurate and low-efort extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1126-1135.

Digital Library

[16]

Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic spreadsheet cell clustering and smell detection using strong and weak features. In Proceedings of International Conference on Software Engineering (ICSE). 464-475.

Digital Library

[17]

Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. 2019. TableSense: Spreadsheet table detection with convolutional neural networks. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). 69-76.

[18]

Wensheng Dou, Shing-Chi Cheung, Chushu Gao, Chang Xu, Liang Xu, and Jun Wei. 2016. Detecting table clones and smells in spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 787-798.

Digital Library

[19]

Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is spreadsheet ambiguity harmful? Detecting and repairing spreadsheet smells due to ambiguous computation. In Proceedings of International Conference on Software Engineering (ICSE). 848-858.

Digital Library

[20]

Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. 2018. Expandable group identification in spreadsheets. In Proceedings of International Conference on Automated Software Engineering (ASE). 498-508.

Digital Library

[21]

Wensheng Dou, Chang Xu, S. C. Cheung, and Jun Wei. 2017. CACheck: Detecting and repairing cell arrays in spreadsheets. IEEE Transactions on software Engineering (TSE) 43, 3 ( 2017 ), 226-251.

Digital Library

[22]

Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, and Tao Huang. 2016. VEnron: A versioned spreadsheet corpus and related evolution analysis. In Proceedings of International Conference on Software Engineering (ICSE). 162-171.

Digital Library

[23]

Stéphane Ducasse, Oscar Nierstrasz, and Matthias Rieger. 2004. Lightweight detection of duplicated code-A language-independent approach. Institute for Applied Mathematics and Computer Science, University of Berne ( 2004 ).

[24]

Marc Fisher and Gregg Rothermel. 2005. The EUSES spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. 30, 4 ( 2005 ), 1-5.

[25]

Felienne Hermans, Bas Jansen, Sohon Roy, Efthimia Aivaloglou, Alaaeddin Swidan, and David Hoepelman. 2016. Spreadsheets are code: An overview of software engineering approaches applied to spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution, and Reengineering (SANER). 56-65.

[26]

Felienne Hermans and Emerson Murphy-Hill. 2015. Enron's spreadsheets and related emails: A dataset and analysis. In Proceedings of International Conference on Software Engineering (ICSE), Vol. 2. 7-16.

[27]

Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2011. Supporting professional spreadsheet users by generating leveled dataflow diagrams. In Proceedings of International Conference on Software Engineering (ICSE). 451-460.

Digital Library

[28]

Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting and visualizing inter-worksheet smells in spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 441-451.

[29]

Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automatically extracting class diagrams from spreadsheets. In Proceedings of European Conference on Object-Oriented Programming (ECOOP). 52-75.

[30]

Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data clone detection and visualization in spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 292-301.

[31]

Felienne Hermans and Tijs van der Storm. 2015. Copy-paste tracking: Fixing spreadsheets without breaking them. In Proceedings of International Conference on Live Coding (ICLC).

[32]

Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of International Conference on Software Engineering (ICSE). 96-105.

Digital Library

[33]

Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE). 55-64.

Digital Library

[34]

Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of European Conference on Machine Learning (ECML). 137-142.

Digital Library

[35]

Elmar Juergens, Florian Deissenboeck, and Benjamin Hummel. 2009. CloneDetective-A workbench for clone detection research. In Proceedings of International Conference on Software Engineering (ICSE). 603-606.

Digital Library

[36]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering (TSE) 28, 7 ( 2002 ), 654-670.

Digital Library

[37]

Zaheer Ullah Khan, Maqsood Hayat, and Muazzam Ali Khan. 2015. Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. Journal of Theoretical Biology 365 ( 2015 ), 197-203.

[38]

Bryan Klimt and Yiming Yang. 2004. The Enron corpus: A new dataset for email classification research. In Proceedings of European Conference on Machine Learning (ECML). 217-226.

Digital Library

[39]

Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016. A machine learning approach for layout inference in spreadsheets. In Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. 77-88.

Digital Library

[40]

Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of Working Conference on Reverse Engineering (WCRE). 1095-1350.

[41]

S Lee. 2005. Application of logistic regression model and its validation for landslide susceptibility mapping using GIS and remote sensing data. International Journal of Remote Sensing 26, 7 ( 2005 ), 1477-1491.

[42]

Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A deep learning-based clone detection approach. In Proceedings of International Conference on Software Maintenance and Evolution (ICSME). 249-260.

[43]

Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering (TSE) 32, 3 ( 2006 ), 176-192.

Digital Library

[44]

Ephraim R McLean, Leon A Kappelman, and John P Thompson. 1993. Converging end-user and corporate computing. Commun. ACM 36, 12 ( 1993 ), 78-90.

[45]

Raymond R Panko. 2008. Spreadsheet errors: What we know. What we think we can do. arXiv preprint arXiv:0802.3457 ( 2008 ).

[46]

Stephen G. Powell, Kenneth R. Baker, and Barry Lawson. 2008. A critical review of the literature on spreadsheet errors. 46, 1 ( 2008 ), 128-138.

[47]

S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics 21, 3 ( 1991 ), 660-674.

[48]

Christopher Scafidi, Mary Shaw, and Brad Myers. 2005. Estimating the numbers of end users and end user programmers. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 207-214.

Digital Library

[49]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of International Conference on Automated Software Engineering (ASE). 87-98.

Digital Library

[50]

Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, and Tao Huang. 2017. SpreadCluster: Recovering versioned spreadsheets through similarity-based clustering. In Proceedings of International Conference on Mining Software Repositories (MSR). 158-169.

Digital Library

[51]

Liang Xu, Wensheng Dou, Jiaxin Zhu, Chushu Gao, Jun Wei, and Tao Huang. 2018. How are spreadsheet templates ssed in practice: A case study on Enron. In Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 734-738.

[52]

Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang. 2018. Detecting faulty empty cells in spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution and Reengineering (SANER). 423-433.

Cited By

Yang XRajbahadur GLin DWang SJiang Z(2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
https://doi.org/10.1145/3676961
Poon PLau MYu YTang S(2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
https://doi.org/10.1007/s11704-023-2384-6
Lange DSahai SPhillips JLex A(2023)Ferret: Reviewing Tabular Datasets for ManipulationComputer Graphics Forum10.1111/cgf.1482242:3(187-198)Online publication date: 27-Jun-2023
https://doi.org/10.1111/cgf.14822
Show More Cited By

Index Terms

Learning to detect table clones in spreadsheets
1. Applied computing
  1. Computers in other domains
    1. Personal computers and PC applications
      1. Spreadsheets
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Semantic table structure identification in spreadsheets
ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Spreadsheets are widely used in various business tasks, and contain amounts of valuable data. However, spreadsheet tables are usually organized in a semi-structured way, and contain complicated semantic structures, e.g., header types and relations among ...
Detecting table clones and smells in spreadsheets
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

Spreadsheets are widely used by end users for various business tasks, such as data analysis and financial reporting. End users may perform similar tasks by cloning a block of cells (table) in their spreadsheets. The corresponding cells in these cloned ...
Viewing simple clones from structural clones' perspective
IWSC '11: Proceedings of the 5th International Workshop on Software Clones

In previous work, we described a technique for detecting designlevel similar program structures that we called structural clones. Structural clones are recurring configurations of simple clones (i.e., similar code fragments). In this paper, we show how ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2020

591 pages

ISBN:9781450380089

DOI:10.1145/3395363

General Chair:
Sarfraz Khurshid
University of Texas at Austin, USA
,
Program Chair:
Corina S. Păsăreanu
Carnegie Mellon University Silicon Valley / NASA Ames Research Center, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISSTA '20

Sponsor:

SIGSOFT

ISSTA '20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 18 - 22, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
202
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang XRajbahadur GLin DWang SJiang Z(2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
https://doi.org/10.1145/3676961
Poon PLau MYu YTang S(2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
https://doi.org/10.1007/s11704-023-2384-6
Lange DSahai SPhillips JLex A(2023)Ferret: Reviewing Tabular Datasets for ManipulationComputer Graphics Forum10.1111/cgf.1482242:3(187-198)Online publication date: 27-Jun-2023
https://doi.org/10.1111/cgf.14822
Makedonski PGrabowski J(2022)Facilitating the co-evolution of semantic descriptions in standards and modelsInformation and Software Technology10.1016/j.infsof.2021.106763143:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.infsof.2021.106763
Zhang YLv XDong HDou WHan SZhang DWei JYe DCadar CZhang X(2021)Semantic table structure identification in spreadsheetsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464812(283-295)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3460319.3464812
Makedonski PGrabowski JGherbi AHamou-Lhadj WBali A(2020)Facilitating the Co-Evolution of Semantic Descriptions in Standards and ModelsProceedings of the 12th System Analysis and Modelling Conference10.1145/3419804.3421449(75-84)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3419804.3421449

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents