research-article

Query-based configuration of text retrieval solutions for software engineering tasks

Authors:

Gabriele Bavota,

Massimiliano Di Penta,

Andrian MarcusAuthors Info & Claims

ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering

Pages 567 - 578

https://doi.org/10.1145/2786805.2786859

Published: 30 August 2015 Publication History

Abstract

Text Retrieval (TR) approaches have been used to leverage the textual information contained in software artifacts to address a multitude of software engineering (SE) tasks. However, TR approaches need to be configured properly in order to lead to good results. Current approaches for automatic TR configuration in SE configure a single TR approach and then use it for all possible queries. In this paper, we show that such a configuration strategy leads to suboptimal results, and propose QUEST, the first approach bringing TR configuration selection to the query level. QUEST recommends the best TR configuration for a given query, based on a supervised learning approach that determines the TR configuration that performs the best for each query according to its properties. We evaluated QUEST in the context of feature and bug localization, using a data set with more than 1,000 queries. We found that QUEST is able to recommend one of the top three TR configurations for a query with a 69% accuracy, on average. We compared the results obtained with the configurations recommended by QUEST for every query with those obtained using a single TR configuration for all queries in a system and in the entire data set. We found that using QUEST we obtain better results than with any of the considered TR configurations.

References

[1]

A. Abadi, M. Nisenson, and Y. Simionovici. A traceability technique for specifications. In Proceedings of 16th IEEE Int’l Conf. on Program Comprehension, pages 103–112, Amsterdam, The Netherlands, 2008.

Digital Library

[2]

IEEE CS Press.

[3]

Amati, G Rijsbergen, Van Rijsbergen, and C. J. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4):357–389, 2002.

Digital Library

[4]

G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. Information retrieval models for recovering traceability links between code and documentation. In Proceedings of 16th IEEE Int’l Conf. on Software Maintenance, pages 40–51, San Jose, California, USA, 2000. IEEE CS Press.

Digital Library

[5]

G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering, 28(10):970–983, 2002.

Digital Library

[6]

J. Anvik, L. Hiew, and G. Murphy. Who should fix this bug? In Proceedings of 28th IEEE/ACM Int’l Conf. on Software Engineering, pages 361–370, Shanghai, China, 2006. IEEE CS Press.

Digital Library

[7]

L. R. Biggers, C. Bocovich, R. Capshaw, B. P. Eddy, L. H. Etzkorn, and N. A. Kraft. Configuring latent Dirichlet allocation based feature location. Empirical Software Engineering, 19(3):465–500, Aug. 2012.

Digital Library

[8]

D. Binkley, D. Heinz, D. Lawrie, and J. Overfelt. Understanding LDA in source code analysis. In Proceedings of the 22Nd Int’l Conf. on Program Comprehension, pages 26–36, New York, NY, USA, 2014. ACM.

Digital Library

[9]

D. Binkley and D. Lawrie. Learning to rank improves IR in SE. In Proceedings of the 20th IEEE Int’l Conf. on Software Maintenance and Evolution, pages 441–445, Sept. 2014.

Digital Library

[10]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.

Digital Library

[11]

L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall, 1984.

[12]

G. Canfora and L. Cerulo. Fine grained indexing of software repositories to support impact analysis. In Proceedings of the 2006 Int’l Workshop on Mining Software Repositories, pages 105–111, New York, NY, USA, 2006. ACM.

Digital Library

[13]

D. Carmel and E. Yom-Tov. Estimating the query difficulty for information retrieval. In Proceedings of the 33rd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 911–911, New York, NY, USA, 2010. ACM.

Digital Library

[14]

S. Clinchant and E. Gaussier. Information-based models for ad hoc IR. In Proceedings of the 33rd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 234–241, New York, NY, USA, 2010. ACM.

Digital Library

[15]

W. J. Conover. Practical Nonparametric Statistics. Wiley, 3rd edition edition, 1998.

[16]

A. De Lucia, M. Risi, G. Tortora, and G. Scanniello. Clustering algorithms and latent semantic indexing to identify similar pages in web applications. In 9th IEEE Int’l Workshop on Web Site Evolution, pages 65–72, Oct. 2007.

Digital Library

[17]

B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk. Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process, 25(1):53–95, 2013.

[18]

D. Falessi, G. Cantone, and G. Canfora. Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Transactions on Software Engineering, 39(1):18–44, Jan. 2013.

Digital Library

[19]

M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In 19th Int’l Conf. on Software Maintenance, pages 23–32, Amsterdam, The Netherlands, 2003. IEEE Computer Society.

Digital Library

[20]

G. Gay, S. Haiduc, A. Marcus, and T. Menzies. On the use of relevance feedback in IR-based concept location. In Proceedings of the 15th IEEE Int’l Conf. on Software Maintenance, pages 351–360, Sept. 2009.

[21]

X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum. Query dependent ranking using k-nearest neighbor. In Proceedings of the 31st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 115–122, New York, NY, USA, 2008. ACM.

Digital Library

[22]

M. Gethers, R. Oliveto, D. Poshyvanyk, and A. D. Lucia. On integrating orthogonal information retrieval methods to improve traceability recovery. In Proceedings of the Int’l Conf. of Software Maintenance, pages 133–142, 2011.

Digital Library

[23]

S. Grant and J. Cordy. Estimating the optimal number of latent concepts in source code analysis. In 10th IEEE Working Conf. on Source Code Analysis and Manipulation, pages 65–74, Sept. 2010.

Digital Library

[24]

S. Grant, J. R. Cordy, and D. B. Skillicorn. Using heuristics to estimate an appropriate number of latent topics in source code analysis. Science of Computer Programming, 78(9):1663–1678, 2013.

[25]

S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 Int’l Conf. on Software Engineering, pages 842–851, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[26]

S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and A. Marcus. Automatic query performance assessment during the retrieval of software artifacts. In Proceedings of the 27th IEEE/ACM Int’l Conf. on Automated Software Engineering, pages 90–99, New York, NY, USA, 2012. ACM.

Digital Library

[27]

C. Hau↵. Predicting the e↵ectiveness of queries and retrieval systems. SIGIR Forum, 44(1):88–88, Aug. 2010.

Digital Library

[28]

B. He and I. Ounis. Inferring query performance using pre-retrieval predictors. In SPIRE, pages 43–54, 2004.

[29]

J. He, M. Larson, and M. De Rijke. Using coherence-based measures to predict query difficulty. In Proceedings of the IR Research, 30th European Conf. on Advances in Information Retrieval, pages 689–694, Berlin, Heidelberg, 2008. Springer-Verlag.

Digital Library

[30]

S. Holm. A simple sequentially rejective Bonferroni test procedure. Scandinavian Journal on Statistics, 6:65–70, 1979.

[31]

I.-H. Kang and G. Kim. Query type classification for web document retrieval. In Proceedings of the 26th Annual Int’l ACM SIGIR Conf. on Research and Development in Informaion Retrieval, pages 64–71, New York, NY, USA, 2003. ACM.

Digital Library

[32]

A. Kuhn, S. Ducasse, and T. Gˆırba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3):230–243, 2007.

Digital Library

[33]

T.-Y. Liu. Learning to Rank for Information Retrieval. Springer Berlin Heidelberg, 2011.

[34]

T.-Y. Liu. Query-dependent ranking. In Learning to Rank for Information Retrieval, pages 113–121. Springer Berlin Heidelberg, 2011.

[35]

K. K. Lo, M. K. Chan, and E. Baniassad. Isolating and relating concerns in requirements using latent semantic analysis. In Proceedings of ACM SIGPLAN Int’l Conf. on Object-Oriented Programming, Systems, Languages and Applications, pages 383–396, Portland, Oregon, USA, 2006. ACM Press.

Digital Library

[36]

S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang. Improving trace accuracy through data-driven configuration and composition of tracing features. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 378–388, New York, NY, USA, 2013. ACM.

Digital Library

[37]

S. K. Lukins, N. A. Kraft, and L. H. Etzkorn. Bug localization using latent Dirichlet allocation. Information and Software Technology, 52(9):972–990, Sept. 2010.

Digital Library

[38]

A. Marcus and G. Antoniol. On the use of text retrieval techniques in software engineering. In 34th IEEE/ACM Int’l Conf. on Software Engineering, Technical Briefing, 2012.

[39]

A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In Proceedings of 11th Working Conf. on Reverse Engineering, pages 214–223, Delft, The Netherlands, 2004. IEEE CS Press.

Digital Library

[40]

A. T. Nguyen, T. T. Nguyen, J. Al-Kofahi, H. V. Nguyen, and T. Nguyen. A topic-based approach for narrowing the search space of buggy files from a bug report. In 2011 26th IEEE/ACM Int’l Conf. on Automated Software Engineering, pages 263–272, Nov. 2011.

Digital Library

[41]

A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia. How to e↵ectively use topic models for software engineering tasks? an approach based on genetic algorithms. In Proceedings of the 2013 Int’l Conf. on Software Engineering, pages 522–531, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[42]

V. Plachouras, B. He, and I. Ounis. University of Glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier. In Proceedings of the 13th Text REtrieval Conf. NIST Special Publication, 2004.

[43]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 275–281, New York, NY, USA, 1998. ACM.

Digital Library

[44]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[45]

S. Rao and A. Kak. Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In Proceedings of the 8th Working Conf. on Mining Software Repositories, pages 43–52, New York, NY, USA, 2011. ACM.

Digital Library

[46]

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. pages 109–126, 1996.

[47]

Z. Shi, J. Keung, and Q. Song. An empirical study of bm25 and bm25f based feature location techniques. In Proceedings of the Int’l Workshop on Innovative Software Development Methodologies, pages 106–114, New York, NY, USA, 2014. ACM.

Digital Library

[48]

A. Shtok, O. Kurland, and D. Carmel. Predicting query performance by querydrift estimation. In 2nd Int’l Conf. on Theory of Information Retrieval, 2009.

Digital Library

[49]

S. Thomas, M. Nagappan, D. Blostein, and A. Hassan. The impact of classifier configuration and classifier combination on bug localization. IEEE Transactions on Software Engineering, 39(10):1427–1443, Oct. 2013.

Digital Library

[50]

S. Wang, D. Lo, and J. Lawall. Compositional vector space models for improved bug localization. In Proceeding of the 20th IEEE Int’l Conf. on Software Maintenance and Evolution, pages 171–180, Sept. 2014.

Digital Library

[51]

C.-P. Wong, Y. Xiong, H. Zhang, D. Hao, L. Zhang, and H. Mei. Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In 2014 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME), pages 181–190, Sept. 2014.

Digital Library

[52]

J. Yang and L. Tan. Inferring semantically related words from software context. In 9th IEEE Working Conf. on Mining Software Repositories, pages 161–170, 2012.

Digital Library

[53]

X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT Int’l Symposium on Foundations of Software Engineering, pages 689–699, New York, NY, USA, 2014. ACM.

Digital Library

[54]

E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow. Metasearch and federation using query difficulty prediction. In 28th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, Workshop on Query prediction and its applications, 2005.

[55]

C. Zhai and J. La↵erty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, Apr. 2004.

Digital Library

[56]

Y. Zhou and W. B. Croft. Ranking robustness: a novel framework to predict query performance. In 15th ACM Int’l Conf. on Information and Knowledge Management, 2006.

Digital Library

[57]

Y. Zhou and W. B. Croft. Query performance prediction in web search environments. In 30th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 543–550. ACM, 2007.

Digital Library

Cited By

Chakraborty PArumugam VNagappan MIzadi MDi Sorbo APanichella S(2024)Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug LocalizationProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering10.1145/3643787.3648028(1-8)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3643787.3648028
Chakraborty PAlfadel MNagappan M(2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: Oct-2024
https://doi.org/10.1109/TSE.2024.3452595
Rahman MRoy C(2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
https://dl.acm.org/doi/10.1145/3607179
Show More Cited By

Index Terms

Query-based configuration of text retrieval solutions for software engineering tasks
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Software management
        Software maintenance
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues

Recommendations

Configuration research and commercial solutions

In this paper we intend to motivate various research areas in configuration, based on our experience in developing commercial configuration solutions. Informal definitions are given for the configuration task and for configuration specification and ...
Towards Automated Configuration Documentation
ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Configurability is a common property of software allowing programs to be customized for the user. While configurability is pervasive, it can also lead to faults (or misconfigurations) and make program evolution challenging. Dependencies can be missed, ...
Understanding and discovering software configuration dependencies in cloud and datacenter systems
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A large percentage of real-world software configuration issues, such as misconfigurations, involve multiple interdependent configuration parameters. However, existing techniques and tools either do not consider dependencies among configuration ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering

August 2015

1068 pages

ISBN:9781450336758

DOI:10.1145/2786805

General Chair:
Elisabetta Di Nitto
Politecnico di Milano, Italy
,
Program Chairs:
Mark Harman
University College London, UK
,
Patrick Heymans
University of Namur, Belgium

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE'15

Sponsor:

SIGSOFT

ESEC/FSE'15: Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering

August 30 - September 4, 2015

Bergamo, Italy

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
468
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chakraborty PArumugam VNagappan MIzadi MDi Sorbo APanichella S(2024)Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug LocalizationProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering10.1145/3643787.3648028(1-8)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3643787.3648028
Chakraborty PAlfadel MNagappan M(2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: Oct-2024
https://doi.org/10.1109/TSE.2024.3452595
Rahman MRoy C(2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
https://dl.acm.org/doi/10.1145/3607179
Khomh FMasudur Rahman MBarbez A(2023)Intelligent Software MaintenanceOptimising the Software Development Process with Artificial Intelligence10.1007/978-981-19-9948-2_9(241-275)Online publication date: 20-Jul-2023
https://doi.org/10.1007/978-981-19-9948-2_9
Ali SHafeez YHumayun MJhanjhi NGhoniem R(2022)An Aspects Framework for Component-Based Requirements Prediction and Regression TestingSustainability10.3390/su14211456314:21(14563)Online publication date: 5-Nov-2022
https://doi.org/10.3390/su142114563
Yang YXia XLo DBi TGrundy JYang X(2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3503509
Fejzer MNarebski JPrzymus PStencel K(2022)Tracking Buggy Files: New Efficient Adaptive Bug Localization AlgorithmIEEE Transactions on Software Engineering10.1109/TSE.2021.306444748:7(2557-2569)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TSE.2021.3064447
Razzaq AVentresque AKoschke RDe Lucia ABuckley J(2022)The Effect of Feature Characteristics on the Performance of Feature Location TechniquesIEEE Transactions on Software Engineering10.1109/TSE.2021.304973548:6(2066-2085)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TSE.2021.3049735
Kim KGhatpande SLiu KKoyuncu AKim DBissyandé TKlein JTraon Y(2022)DigBug—Pre/post-processing operator selection for accurate bug localizationJournal of Systems and Software10.1016/j.jss.2022.111300189(111300)Online publication date: Jul-2022
https://doi.org/10.1016/j.jss.2022.111300
Razzaq ABuckley JPatten JChochlov MSai A(2021)BoostNSift: A Query Boosting and Code Sifting Technique for Method Level Bug Localization2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM52516.2021.00019(81-91)Online publication date: Sep-2021
https://doi.org/10.1109/SCAM52516.2021.00019
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents