Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3395363.3397384acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Learning to detect table clones in spreadsheets

Published: 18 July 2020 Publication History

Abstract

In order to speed up spreadsheet development productivity, end users can create a spreadsheet table by copying and modifying an existing one. These two tables share the similar computational semantics, and form a table clone. End users may modify the tables in a table clone, e.g., adding new rows and deleting columns, thus introducing structure changes into the table clone. Our empirical study on real-world spreadsheets shows that about 58.5% of table clones involve structure changes. However, existing table clone detection approaches in spreadsheets can only detect table clones with the same structures. Therefore, many table clones with structure changes cannot be detected.
We observe that, although the tables in a table clone may be modified, they usually share the similar structures and formats, e.g., headers, formulas and background colors. Based on this observation, we propose LTC (Learning to detect Table Clones), to automatically detect table clones with or without structure changes. LTC utilizes the structure and format information from labeled table clones and non table clones to train a binary classifier. LTC first identifies tables in spreadsheets, and then uses the trained binary classifier to judge whether every two tables can form a table clone. Our experiments on real-world spreadsheets from the EUSES and Enron corpora show that, LTC can achieve a precision of 97.8% and recall of 92.1% in table clone detection, significantly outperforming the state-of-the-art technique (a precision of 37.5% and recall of 11.1%).

References

[1]
2007. scikit-learn: Machine learning in Python. Retrieved Jan 15, 2020 from https://scikit-learn.org
[2]
2020. Apache POI-the Java API for Microsoft Documents. Retrieved Jan 15, 2020 from https://poi.apache.org/
[3]
2020. Ideas in Excel. Retrieved January 15, 2020 from https://support.ofice.com/enie/article/ideas-in-excel-3223aab8-f543-4fda-85ed-76bb0295fc4
[4]
2020. Power BI | Interactive Data Visualization BI Tools. Retrieved Jan 15, 2020 from https://powerbi.microsoft.com
[5]
2020. Weka 3: Machine Learning Software in Java. Retrieved Jan 15, 2020 from http://www.cs.waikato.ac.nz/ml/weka
[6]
Robin Abraham and Martin Erwig. 2004. Header and unit inference for spreadsheets through spatial analyses. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 165-172.
[7]
Robin Abraham and Martin Erwig. 2007. UCheck: A spreadsheet type checker for end users. Journal of Visual Languages and Computing 18, 1 ( 2007 ), 71-95.
[8]
Robin Abraham, Martin Erwig, Steve Kollmansberger, and Ethan Seifert. 2005. Visual specifications of correct spreadsheets. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 189-196.
[9]
Brenda S Baker. 1995. On finding duplication and near-duplication in large software systems. In Proceedings of Working Conference on Reverse Engineering (WCRE). 86-95.
[10]
Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. Fuse: A reproducible, extendable, internet-scale corpus of spreadsheets. In Proceedings of Working Conference on Mining Software Repositories (MSR). 486-489.
[11]
Daniel W. Barowy, Emery D. Berger, and Benjamin Zorn. 2018. ExceLint: Automatically finding spreadsheet formula errors. In Proceedings of International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 148 : 1-148 : 26.
[12]
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings of International Conference on Software Maintenance (ICSM). 368-377.
[13]
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering (TSE) 33, 9 ( 2007 ), 577-591.
[14]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 ( 2001 ), 5-32.
[15]
Zhe Chen and Michael Cafarella. 2014. Integrating spreadsheet data via accurate and low-efort extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1126-1135.
[16]
Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic spreadsheet cell clustering and smell detection using strong and weak features. In Proceedings of International Conference on Software Engineering (ICSE). 464-475.
[17]
Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. 2019. TableSense: Spreadsheet table detection with convolutional neural networks. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI). 69-76.
[18]
Wensheng Dou, Shing-Chi Cheung, Chushu Gao, Chang Xu, Liang Xu, and Jun Wei. 2016. Detecting table clones and smells in spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 787-798.
[19]
Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is spreadsheet ambiguity harmful? Detecting and repairing spreadsheet smells due to ambiguous computation. In Proceedings of International Conference on Software Engineering (ICSE). 848-858.
[20]
Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. 2018. Expandable group identification in spreadsheets. In Proceedings of International Conference on Automated Software Engineering (ASE). 498-508.
[21]
Wensheng Dou, Chang Xu, S. C. Cheung, and Jun Wei. 2017. CACheck: Detecting and repairing cell arrays in spreadsheets. IEEE Transactions on software Engineering (TSE) 43, 3 ( 2017 ), 226-251.
[22]
Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, and Tao Huang. 2016. VEnron: A versioned spreadsheet corpus and related evolution analysis. In Proceedings of International Conference on Software Engineering (ICSE). 162-171.
[23]
Stéphane Ducasse, Oscar Nierstrasz, and Matthias Rieger. 2004. Lightweight detection of duplicated code-A language-independent approach. Institute for Applied Mathematics and Computer Science, University of Berne ( 2004 ).
[24]
Marc Fisher and Gregg Rothermel. 2005. The EUSES spreadsheet corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. 30, 4 ( 2005 ), 1-5.
[25]
Felienne Hermans, Bas Jansen, Sohon Roy, Efthimia Aivaloglou, Alaaeddin Swidan, and David Hoepelman. 2016. Spreadsheets are code: An overview of software engineering approaches applied to spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution, and Reengineering (SANER). 56-65.
[26]
Felienne Hermans and Emerson Murphy-Hill. 2015. Enron's spreadsheets and related emails: A dataset and analysis. In Proceedings of International Conference on Software Engineering (ICSE), Vol. 2. 7-16.
[27]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2011. Supporting professional spreadsheet users by generating leveled dataflow diagrams. In Proceedings of International Conference on Software Engineering (ICSE). 451-460.
[28]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting and visualizing inter-worksheet smells in spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 441-451.
[29]
Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automatically extracting class diagrams from spreadsheets. In Proceedings of European Conference on Object-Oriented Programming (ECOOP). 52-75.
[30]
Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data clone detection and visualization in spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 292-301.
[31]
Felienne Hermans and Tijs van der Storm. 2015. Copy-paste tracking: Fixing spreadsheets without breaking them. In Proceedings of International Conference on Live Coding (ICLC).
[32]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of International Conference on Software Engineering (ICSE). 96-105.
[33]
Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE). 55-64.
[34]
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of European Conference on Machine Learning (ECML). 137-142.
[35]
Elmar Juergens, Florian Deissenboeck, and Benjamin Hummel. 2009. CloneDetective-A workbench for clone detection research. In Proceedings of International Conference on Software Engineering (ICSE). 603-606.
[36]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering (TSE) 28, 7 ( 2002 ), 654-670.
[37]
Zaheer Ullah Khan, Maqsood Hayat, and Muazzam Ali Khan. 2015. Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. Journal of Theoretical Biology 365 ( 2015 ), 197-203.
[38]
Bryan Klimt and Yiming Yang. 2004. The Enron corpus: A new dataset for email classification research. In Proceedings of European Conference on Machine Learning (ECML). 217-226.
[39]
Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016. A machine learning approach for layout inference in spreadsheets. In Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. 77-88.
[40]
Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Proceedings of Working Conference on Reverse Engineering (WCRE). 1095-1350.
[41]
S Lee. 2005. Application of logistic regression model and its validation for landslide susceptibility mapping using GIS and remote sensing data. International Journal of Remote Sensing 26, 7 ( 2005 ), 1477-1491.
[42]
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A deep learning-based clone detection approach. In Proceedings of International Conference on Software Maintenance and Evolution (ICSME). 249-260.
[43]
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on software Engineering (TSE) 32, 3 ( 2006 ), 176-192.
[44]
Ephraim R McLean, Leon A Kappelman, and John P Thompson. 1993. Converging end-user and corporate computing. Commun. ACM 36, 12 ( 1993 ), 78-90.
[45]
Raymond R Panko. 2008. Spreadsheet errors: What we know. What we think we can do. arXiv preprint arXiv:0802.3457 ( 2008 ).
[46]
Stephen G. Powell, Kenneth R. Baker, and Barry Lawson. 2008. A critical review of the literature on spreadsheet errors. 46, 1 ( 2008 ), 128-138.
[47]
S Rasoul Safavian and David Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics 21, 3 ( 1991 ), 660-674.
[48]
Christopher Scafidi, Mary Shaw, and Brad Myers. 2005. Estimating the numbers of end users and end user programmers. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 207-214.
[49]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of International Conference on Automated Software Engineering (ASE). 87-98.
[50]
Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, and Tao Huang. 2017. SpreadCluster: Recovering versioned spreadsheets through similarity-based clustering. In Proceedings of International Conference on Mining Software Repositories (MSR). 158-169.
[51]
Liang Xu, Wensheng Dou, Jiaxin Zhu, Chushu Gao, Jun Wei, and Tao Huang. 2018. How are spreadsheet templates ssed in practice: A case study on Enron. In Proceedings of ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 734-738.
[52]
Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang. 2018. Detecting faulty empty cells in spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution and Reengineering (SANER). 423-433.

Cited By

View all
  • (2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2023)Ferret: Reviewing Tabular Datasets for ManipulationComputer Graphics Forum10.1111/cgf.1482242:3(187-198)Online publication date: 27-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2020
591 pages
ISBN:9781450380089
DOI:10.1145/3395363
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Spreadsheet
  2. format
  3. structure
  4. table clone

Qualifiers

  • Research-article

Conference

ISSTA '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SimClone: Detecting Tabular Data Clones using Value SimilarityACM Transactions on Software Engineering and Methodology10.1145/3676961Online publication date: 16-Jul-2024
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2023)Ferret: Reviewing Tabular Datasets for ManipulationComputer Graphics Forum10.1111/cgf.1482242:3(187-198)Online publication date: 27-Jun-2023
  • (2022)Facilitating the co-evolution of semantic descriptions in standards and modelsInformation and Software Technology10.1016/j.infsof.2021.106763143:COnline publication date: 1-Mar-2022
  • (2021)Semantic table structure identification in spreadsheetsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464812(283-295)Online publication date: 11-Jul-2021
  • (2020)Facilitating the Co-Evolution of Semantic Descriptions in Standards and ModelsProceedings of the 12th System Analysis and Modelling Conference10.1145/3419804.3421449(75-84)Online publication date: 19-Oct-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media