Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICSE.2019.00035acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Automatically generating precise Oracles from structured natural language specifications

Published: 25 May 2019 Publication History

Abstract

Software specifications often use natural language to describe the desired behavior, but such specifications are difficult to verify automatically. We present Swami, an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. Swami focuses on exceptional behavior and boundary conditions that often cause field failures but that developers often fail to manually write tests for. Evaluated on the official JavaScript specification (ECMA-262), 98.4% of the tests Swami generated were precise to the specification. Using Swami to augment developer-written test suites improved coverage and identified 1 previously unknown defect and 15 missing JavaScript features in Rhino, 1 previously unknown defect in Node.js, and 18 semantic ambiguities in the ECMA-262 specification.

References

[1]
American fuzzy lop. http://lcamtuf.coredump.cx/afl/, 2018.
[2]
Paul Ammann and Jeff Offutt. Introduction to Software Testing. Cambridge University Press, 1 edition, 2008.
[3]
Rico Angell, Brittany Johnson, Yuriy Brun, and Alexandra Meliou. Themis: Automatically testing software for discrimination. In ESEC/FSE Demo, pages 871--875, 2018.
[4]
European Computer Manufacturer's Association. ECMA standards. https://ecma-international.org/publications/standards/Standard.htm, 2018.
[5]
Ivan Beschastnikh, Yuriy Brun, Jenny Abrahamson, Michael D. Ernst, and Arvind Krishnamurthy. Using declarative specification to improve the understanding, extensibility, and comparison of model-inference algorithms. IEEE TSE, 41(4):408--428, April 2015.
[6]
Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. Inferring models of concurrent systems from logs of their behavior with CSight. In ICSE, pages 468--479, 2014.
[7]
Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In ESEC/FSE, pages 267--277, 2011.
[8]
Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. Translating code comments to procedure specifications. In ISSTA, pages 242--253, 2018.
[9]
Chandrasekhar Boyapati, Sarfraz Khurshid, and Darko Marinov. Korat: Automated testing based on Java predicates. In ISSTA, pages 123--133, 2002.
[10]
Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In S&P, pages 114--129, 2014.
[11]
Yuriy Brun and Alexandra Meliou. Software fairness. In ESEC/FSE New Ideas and Emerging Results, pages 754--759, 2018.
[12]
Yuting Chen and Zhendong Su. Guided differential testing of certificate validation in SSL/TLS implementations. In ESEC/FSE, pages 793--804, 2015.
[13]
Flaviu Cristian. Exception handling. Technical Report RJ5724, IBM Research, 1987.
[14]
Valentin Dallmeier, Nikolai Knopp, Christoph Mallon, Sebastian Hack, and Andreas Zeller. Generating test cases for specification mining. In ISSTA, pages 85--96, 2010.
[15]
Bogdan Dit, Meghan Revelle, Malcom Gethers, and Denys Poshyvanyk. Feature location in source code: A taxonomy and survey. Journal of Software: Evolution and Process, 25(1):53--95, 2013.
[16]
Marc Eaddy. Concern tagger case study data mapping the Rhino source code to the ECMA-262 specification). http://www.cs.columbia.edu/~eaddy/concerntagger/, 2007.
[17]
Marc Eaddy, Alfred V. Aho, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. Cerberus: Tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. In ICPC, pages 53--62, 2008.
[18]
Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE TSE, 27(2):99--123, 2001.
[19]
Robert B. Evans and Alberto Savoia. Differential testing: A new approach to change detection. In ESEC/FSE Poster, pages 549--552, 2007.
[20]
Gordon Fraser and Andrea Arcuri. Whole test suite generation. IEEE TSE, 39(2):276--291, February 2013.
[21]
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, and Alexandra Meliou. A characterization of the complexity of resilience and responsibility for self-join-free conjunctive queries. Proceedings of the VLDB Endowment (PVLDB), 9(3):180--191, 2015.
[22]
Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. Fairness testing: Testing software for discrimination. In ESEC/FSE, pages 498--510, 2017.
[23]
Carlo Ghezzi, Mauro Pezzè, Michele Sama, and Giordano Tamburrelli. Mining behavior models from user-intensive web applications. In ICSE, pages 277--287, 2014.
[24]
Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè. Automatic generation of oracles for exceptional behaviors. In ISSTA, pages 213--224, 2016.
[25]
Emily Hill. Developing natural language-based program analyses and tools to expedite software maintenance. In ICSE Doctoral Symposium, pages 1015--1018, 2008.
[26]
Emily Hill, Shivani Rao, and Avinash Kak. On the use of stemming for concern location and bug localization in Java. In SCAM, pages 184--193, 2012.
[27]
Daniel Jurafsky and James H. Martin. Speech and Language Processing. Pearson Education, Inc., 2 edition, 2009.
[28]
Yalin Ke, Kathryn T. Stolee, Claire Le Goues, and Yuriy Brun. Repairing programs with semantic code search. In ASE, pages 295--306, 2015.
[29]
Ivo Krka, Yuriy Brun, George Edwards, and Nenad Medvidovic. Synthesizing partial component-level behavior models from system specifications. In ESEC/FSE, pages 305--314, 2009.
[30]
Tien-Duy B. Le, Xuan-Bach D. Le, David Lo, and Ivan Beschastnikh. Synergizing specification miners through model fissions and fusions. In ASE, 2015.
[31]
Tien-Duy B. Le and David Lo. Beyond support and confidence: Exploring interestingness measures for rule-based specification mining. In SANER, 2015.
[32]
Owolabi Legunsen, Wajih Ul Hassan, Xinyue Xu, Grigore Roşu, and Darko Marinov. How good are the specs? A study of the bug-finding effectiveness of existing Java API specifications. In ASE, pages 602--613, 2016.
[33]
David Lo and Siau-Cheng Khoo. QUARK: Empirical assessment of automaton-based specification miners. In WCRE, 2006.
[34]
David Lo and Siau-Cheng Khoo. SMArTIC: Towards building an accurate, robust and scalable specification miner. In FSE, pages 265--275, 2006.
[35]
David Lo and Shahar Maoz. Scenario-based and value-based specification mining: Better together. In ASE, pages 387--396, 2010.
[36]
David Lo, Leonardo Mariani, and Mauro Pezzè. Automatic steering of behavioral model inference. In ESEC/FSE, pages 345--354, 2009.
[37]
Fan Long and Martin Rinard. Staged program repair with condition synthesis. In ESEC/FSE, pages 166--178, 2015.
[38]
Fan Long and Martin Rinard. Automatic patch generation by learning correct code. In POPL, pages 298--312, 2016.
[39]
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In ACL, pages 55--60, 2014.
[40]
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. DirectFix: Looking for simple program repairs. In ICSE, pages 448--458, 2015.
[41]
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In International Conference on Software Engineering (ICSE), pages 691--701, May 2016.
[42]
Alexandra Meliou, Wolfgang Gatterbauer, Joseph Y. Halpern, Christoph Koch, Katherine F. Moore, and Dan Suciu. Causality in databases. IEEE Data Engineering Bulletin, 33(3):59--67, 2010.
[43]
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu. The complexity of causality and responsibility for query answers and non-answers. Proceedings of the VLDB Endowment (PVLDB), 4(1):34--45, 2010.
[44]
Alexandra Meliou, Wolfgang Gatterbauer, and Dan Suciu. Bringing provenance to its full potential using causal reasoning. In USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2011.
[45]
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. Causality and explanations in databases. Proceedings of the VLDB Endowment (PVLDB) tutorial, 7(13):1715--1716, 2014.
[46]
Kivanç Muşlu, Yuriy Brun, and Alexandra Meliou. Data debugging with continuous testing. In ESEC/FSE New Ideas, pages 631--634, 2013.
[47]
Kivanç Muşlu, Yuriy Brun, and Alexandra Meliou. Preventing data errors with continuous testing. In ISSTA, pages 373--384, 2015.
[48]
Jeremy W. Nimmer and Michael D. Ernst. Automatic generation of program specifications. In ISSTA, 2002.
[49]
Tony Ohmann, Michael Herzberg, Sebastian Fiss, Armand Halbert, Marc Palyart, Ivan Beschastnikh, and Yuriy Brun. Behavioral resource-aware model inference. In ASE, pages 19--30, 2014.
[50]
Carlos Pacheco and Michael D. Ernst. Randoop: Feedback-directed random testing for Java. In OOPSLA, pages 815--816, 2007.
[51]
Denys Poshyvanyk, Malcom Gethers, and Andrian Marcus. Concept location using formal concept analysis and information retrieval. ACM TOSEM, 21(4):23, 2012.
[52]
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In ISSTA, pages 24--36, 2015.
[53]
Md Masudur Rahman, Saikat Chakraborty, Gail Kaiser, and Baishakhi Ray. A case study on the impact of similarity measure on information retrieval based software engineering tasks. CoRR, abs/1808.02911, 2018.
[54]
Steven P. Reiss and Manos Renieris. Encoding program executions. In ICSE, pages 221--230, 2001.
[55]
Research Triangle Institute. The economic impacts of inadequate infrastructure for software testing. NIST Planning Report 02-3, May 2002.
[56]
Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple BM25 extension to multiple weighted fields. In CIKM, pages 42--49, 2004.
[57]
Stephen E. Robertson, Stephen Walker, and Micheline Beaulieu. Experimentation as a way of life: Okapi at TREC. Information Processing and Management, 36:95--108, January 2000.
[58]
Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. Improving bug localization using structured information retrieval. In ASE, pages 345--355, 2013.
[59]
Vipin Samar and Sangeeta Patni. Differential testing for variational analyses: Experience from developing KConfigReader. CoRR, abs/1706.09357, 2017.
[60]
Matthias Schur, Andreas Roth, and Andreas Zeller. Mining behavior models from enterprise web applications. In ESEC/FSE, pages 422--432, 2013.
[61]
Stelios Sidiroglou-Douskos, Eric Lahtinen, Fan Long, and Martin Rinard. Automatic error elimination by horizontal code transfer across multiple applications. In PLDI, pages 43--54, 2015.
[62]
Edward K. Smith, Earl Barr, Claire Le Goues, and Yuriy Brun. Is the cure worse than the disease? Overfitting in automated program repair. In ESEC/FSE, pages 532--543, 2015.
[63]
Varun Srivastava, Michael D. Bond, Kathryn S. McKinley, and Vitaly Shmatikov. A security policy oracle: Detecting security holes using multiple API implementations. In PLDI, pages 343--354, 2011.
[64]
Trevor Strohman, Donald Metzler, HowardTurtle, and W. Bruce Croft. Indri: A language model-based search engine for complex queries. In International Conference on Intelligence Analysis, pages 2--6, 2005.
[65]
Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T. Leavens. @tComment: Testing Javadoc comments to detect comment-code inconsistencies. In ICST, pages 260--269, 2012.
[66]
Shin Hwei Tan and Abhik Roychoudhury. relifix: Automated repair of software regressions. In ICSE, pages 471--482, 2015.
[67]
Chakkrit Tantithamthavorn, Surafel Abebe Lemma, Ahmed E. Hassan, Akinori Ihara, and Kenichi Matsumoto. The impact of IR-based classifier configuration on the performance and the effort of method-level bug localization. Information and Software Technology, 2018.
[68]
Robert J. Walls, Yuriy Brun, Marc Liberatore, and Brian Neil Levine. Discovering specification violations in networked software systems. In ISSRE, pages 496--506, 2015.
[69]
Qianqian Wang, Yuriy Brun, and Alessandro Orso. Behavioral execution comparison: Are tests representative of field behavior? In ICST, pages 321--332, 2017.
[70]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. Data X-Ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015.
[71]
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. QFix: Demonstrating error diagnosis in query histories. In SIGMOD Demo, pages 2177--2180, 2016.
[72]
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. QFix: Diagnosing errors through query histories. In SIGMOD, pages 1369--1384, 2017.
[73]
Westley Weimer and George C. Necula. Finding and preventing run-time error handling mistakes. In OOPSLA, pages 419--431, 2004.
[74]
Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automatically finding patches using genetic programming. In ICSE, pages 364--374, 2009.
[75]
Aaron Weiss, Arjun Guha, and Yuriy Brun. Tortoise: Interactive system configuration repair. In ASE, pages 625--636, 2017.
[76]
Allen Wirfs-Brock and Brian Terlson. ECMA-262, ECMAScript 2017 language specification, 8th edition. https://www.ecma-international.org/ecma-262/8.0, 2017.
[77]
W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A survey on software fault localization. IEEE TSE, 42(8):707--740, 2016.
[78]
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and understanding bugs in C compilers. In PLDI, pages 283--294, 2011.
[79]
Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. From word embeddings to document similarities for improved information retrieval in software engineering. In ICSE, pages 404--415, 2016.
[80]
Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR, pages 334--342, 2001.

Cited By

View all
  • (2024)DAInfer: Inferring API Aliasing Specifications from Library Documentation via Neurosymbolic OptimizationProceedings of the ACM on Software Engineering10.1145/36608161:FSE(2469-2492)Online publication date: 12-Jul-2024
  • (2024)Automated Program Repair, What Is It Good For? Not Absolutely Nothing!Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639095(1-13)Online publication date: 20-May-2024
  • (2023)Enhancing REST API Testing with NLP TechniquesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598131(1232-1243)Online publication date: 12-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '19: Proceedings of the 41st International Conference on Software Engineering
May 2019
1318 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 25 May 2019

Check for updates

Badges

Qualifiers

  • Research-article

Conference

ICSE '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DAInfer: Inferring API Aliasing Specifications from Library Documentation via Neurosymbolic OptimizationProceedings of the ACM on Software Engineering10.1145/36608161:FSE(2469-2492)Online publication date: 12-Jul-2024
  • (2024)Automated Program Repair, What Is It Good For? Not Absolutely Nothing!Proceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639095(1-13)Online publication date: 20-May-2024
  • (2023)Enhancing REST API Testing with NLP TechniquesProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598131(1232-1243)Online publication date: 12-Jul-2023
  • (2023)Better Automatic Program Repair by Using Bug Reports and Tests TogetherProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00109(1225-1237)Online publication date: 14-May-2023
  • (2023)Understanding Why and Predicting When Developers Adhere to Code-Quality StandardsProceedings of the 45th International Conference on Software Engineering: Software Engineering in Practice10.1109/ICSE-SEIP58684.2023.00045(432-444)Online publication date: 17-May-2023
  • (2023)Proofster: Automated Formal VerificationProceedings of the 45th International Conference on Software Engineering: Companion Proceedings10.1109/ICSE-Companion58688.2023.00018(26-30)Online publication date: 14-May-2023
  • (2023)A decade of code comment quality assessmentJournal of Systems and Software10.1016/j.jss.2022.111515195:COnline publication date: 1-Jan-2023
  • (2022)The promise and perils of using machine learning when engineering software (keynote paper)Proceedings of the 6th International Workshop on Machine Learning Techniques for Software Quality Evaluation10.1145/3549034.3570200(1-4)Online publication date: 7-Nov-2022
  • (2022)DocTer: documentation-guided fuzzing for testing deep learning API functionsProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3533767.3534220(176-188)Online publication date: 18-Jul-2022
  • (2022)NalinProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510144(1469-1481)Online publication date: 21-May-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media