Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2786805.2786859acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Query-based configuration of text retrieval solutions for software engineering tasks

Published: 30 August 2015 Publication History

Abstract

Text Retrieval (TR) approaches have been used to leverage the textual information contained in software artifacts to address a multitude of software engineering (SE) tasks. However, TR approaches need to be configured properly in order to lead to good results. Current approaches for automatic TR configuration in SE configure a single TR approach and then use it for all possible queries. In this paper, we show that such a configuration strategy leads to suboptimal results, and propose QUEST, the first approach bringing TR configuration selection to the query level. QUEST recommends the best TR configuration for a given query, based on a supervised learning approach that determines the TR configuration that performs the best for each query according to its properties. We evaluated QUEST in the context of feature and bug localization, using a data set with more than 1,000 queries. We found that QUEST is able to recommend one of the top three TR configurations for a query with a 69% accuracy, on average. We compared the results obtained with the configurations recommended by QUEST for every query with those obtained using a single TR configuration for all queries in a system and in the entire data set. We found that using QUEST we obtain better results than with any of the considered TR configurations.

References

[1]
A. Abadi, M. Nisenson, and Y. Simionovici. A traceability technique for specifications. In Proceedings of 16th IEEE Int’l Conf. on Program Comprehension, pages 103–112, Amsterdam, The Netherlands, 2008.
[2]
IEEE CS Press.
[3]
Amati, G Rijsbergen, Van Rijsbergen, and C. J. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4):357–389, 2002.
[4]
G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. Information retrieval models for recovering traceability links between code and documentation. In Proceedings of 16th IEEE Int’l Conf. on Software Maintenance, pages 40–51, San Jose, California, USA, 2000. IEEE CS Press.
[5]
G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering, 28(10):970–983, 2002.
[6]
J. Anvik, L. Hiew, and G. Murphy. Who should fix this bug? In Proceedings of 28th IEEE/ACM Int’l Conf. on Software Engineering, pages 361–370, Shanghai, China, 2006. IEEE CS Press.
[7]
L. R. Biggers, C. Bocovich, R. Capshaw, B. P. Eddy, L. H. Etzkorn, and N. A. Kraft. Configuring latent Dirichlet allocation based feature location. Empirical Software Engineering, 19(3):465–500, Aug. 2012.
[8]
D. Binkley, D. Heinz, D. Lawrie, and J. Overfelt. Understanding LDA in source code analysis. In Proceedings of the 22Nd Int’l Conf. on Program Comprehension, pages 26–36, New York, NY, USA, 2014. ACM.
[9]
D. Binkley and D. Lawrie. Learning to rank improves IR in SE. In Proceedings of the 20th IEEE Int’l Conf. on Software Maintenance and Evolution, pages 441–445, Sept. 2014.
[10]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.
[11]
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. Chapman and Hall, 1984.
[12]
G. Canfora and L. Cerulo. Fine grained indexing of software repositories to support impact analysis. In Proceedings of the 2006 Int’l Workshop on Mining Software Repositories, pages 105–111, New York, NY, USA, 2006. ACM.
[13]
D. Carmel and E. Yom-Tov. Estimating the query difficulty for information retrieval. In Proceedings of the 33rd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 911–911, New York, NY, USA, 2010. ACM.
[14]
S. Clinchant and E. Gaussier. Information-based models for ad hoc IR. In Proceedings of the 33rd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 234–241, New York, NY, USA, 2010. ACM.
[15]
W. J. Conover. Practical Nonparametric Statistics. Wiley, 3rd edition edition, 1998.
[16]
A. De Lucia, M. Risi, G. Tortora, and G. Scanniello. Clustering algorithms and latent semantic indexing to identify similar pages in web applications. In 9th IEEE Int’l Workshop on Web Site Evolution, pages 65–72, Oct. 2007.
[17]
B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk. Feature location in source code: a taxonomy and survey. Journal of Software: Evolution and Process, 25(1):53–95, 2013.
[18]
D. Falessi, G. Cantone, and G. Canfora. Empirical principles and an industrial case study in retrieving equivalent requirements via natural language processing techniques. IEEE Transactions on Software Engineering, 39(1):18–44, Jan. 2013.
[19]
M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In 19th Int’l Conf. on Software Maintenance, pages 23–32, Amsterdam, The Netherlands, 2003. IEEE Computer Society.
[20]
G. Gay, S. Haiduc, A. Marcus, and T. Menzies. On the use of relevance feedback in IR-based concept location. In Proceedings of the 15th IEEE Int’l Conf. on Software Maintenance, pages 351–360, Sept. 2009.
[21]
X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum. Query dependent ranking using k-nearest neighbor. In Proceedings of the 31st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 115–122, New York, NY, USA, 2008. ACM.
[22]
M. Gethers, R. Oliveto, D. Poshyvanyk, and A. D. Lucia. On integrating orthogonal information retrieval methods to improve traceability recovery. In Proceedings of the Int’l Conf. of Software Maintenance, pages 133–142, 2011.
[23]
S. Grant and J. Cordy. Estimating the optimal number of latent concepts in source code analysis. In 10th IEEE Working Conf. on Source Code Analysis and Manipulation, pages 65–74, Sept. 2010.
[24]
S. Grant, J. R. Cordy, and D. B. Skillicorn. Using heuristics to estimate an appropriate number of latent topics in source code analysis. Science of Computer Programming, 78(9):1663–1678, 2013.
[25]
S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 Int’l Conf. on Software Engineering, pages 842–851, Piscataway, NJ, USA, 2013. IEEE Press.
[26]
S. Haiduc, G. Bavota, R. Oliveto, A. De Lucia, and A. Marcus. Automatic query performance assessment during the retrieval of software artifacts. In Proceedings of the 27th IEEE/ACM Int’l Conf. on Automated Software Engineering, pages 90–99, New York, NY, USA, 2012. ACM.
[27]
C. Hau↵. Predicting the e↵ectiveness of queries and retrieval systems. SIGIR Forum, 44(1):88–88, Aug. 2010.
[28]
B. He and I. Ounis. Inferring query performance using pre-retrieval predictors. In SPIRE, pages 43–54, 2004.
[29]
J. He, M. Larson, and M. De Rijke. Using coherence-based measures to predict query difficulty. In Proceedings of the IR Research, 30th European Conf. on Advances in Information Retrieval, pages 689–694, Berlin, Heidelberg, 2008. Springer-Verlag.
[30]
S. Holm. A simple sequentially rejective Bonferroni test procedure. Scandinavian Journal on Statistics, 6:65–70, 1979.
[31]
I.-H. Kang and G. Kim. Query type classification for web document retrieval. In Proceedings of the 26th Annual Int’l ACM SIGIR Conf. on Research and Development in Informaion Retrieval, pages 64–71, New York, NY, USA, 2003. ACM.
[32]
A. Kuhn, S. Ducasse, and T. Gˆırba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 49(3):230–243, 2007.
[33]
T.-Y. Liu. Learning to Rank for Information Retrieval. Springer Berlin Heidelberg, 2011.
[34]
T.-Y. Liu. Query-dependent ranking. In Learning to Rank for Information Retrieval, pages 113–121. Springer Berlin Heidelberg, 2011.
[35]
K. K. Lo, M. K. Chan, and E. Baniassad. Isolating and relating concerns in requirements using latent semantic analysis. In Proceedings of ACM SIGPLAN Int’l Conf. on Object-Oriented Programming, Systems, Languages and Applications, pages 383–396, Portland, Oregon, USA, 2006. ACM Press.
[36]
S. Lohar, S. Amornborvornwong, A. Zisman, and J. Cleland-Huang. Improving trace accuracy through data-driven configuration and composition of tracing features. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 378–388, New York, NY, USA, 2013. ACM.
[37]
S. K. Lukins, N. A. Kraft, and L. H. Etzkorn. Bug localization using latent Dirichlet allocation. Information and Software Technology, 52(9):972–990, Sept. 2010.
[38]
A. Marcus and G. Antoniol. On the use of text retrieval techniques in software engineering. In 34th IEEE/ACM Int’l Conf. on Software Engineering, Technical Briefing, 2012.
[39]
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An information retrieval approach to concept location in source code. In Proceedings of 11th Working Conf. on Reverse Engineering, pages 214–223, Delft, The Netherlands, 2004. IEEE CS Press.
[40]
A. T. Nguyen, T. T. Nguyen, J. Al-Kofahi, H. V. Nguyen, and T. Nguyen. A topic-based approach for narrowing the search space of buggy files from a bug report. In 2011 26th IEEE/ACM Int’l Conf. on Automated Software Engineering, pages 263–272, Nov. 2011.
[41]
A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia. How to e↵ectively use topic models for software engineering tasks? an approach based on genetic algorithms. In Proceedings of the 2013 Int’l Conf. on Software Engineering, pages 522–531, Piscataway, NJ, USA, 2013. IEEE Press.
[42]
V. Plachouras, B. He, and I. Ounis. University of Glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier. In Proceedings of the 13th Text REtrieval Conf. NIST Special Publication, 2004.
[43]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 275–281, New York, NY, USA, 1998. ACM.
[44]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[45]
S. Rao and A. Kak. Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In Proceedings of the 8th Working Conf. on Mining Software Repositories, pages 43–52, New York, NY, USA, 2011. ACM.
[46]
S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. pages 109–126, 1996.
[47]
Z. Shi, J. Keung, and Q. Song. An empirical study of bm25 and bm25f based feature location techniques. In Proceedings of the Int’l Workshop on Innovative Software Development Methodologies, pages 106–114, New York, NY, USA, 2014. ACM.
[48]
A. Shtok, O. Kurland, and D. Carmel. Predicting query performance by querydrift estimation. In 2nd Int’l Conf. on Theory of Information Retrieval, 2009.
[49]
S. Thomas, M. Nagappan, D. Blostein, and A. Hassan. The impact of classifier configuration and classifier combination on bug localization. IEEE Transactions on Software Engineering, 39(10):1427–1443, Oct. 2013.
[50]
S. Wang, D. Lo, and J. Lawall. Compositional vector space models for improved bug localization. In Proceeding of the 20th IEEE Int’l Conf. on Software Maintenance and Evolution, pages 171–180, Sept. 2014.
[51]
C.-P. Wong, Y. Xiong, H. Zhang, D. Hao, L. Zhang, and H. Mei. Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In 2014 IEEE Int’l Conf. on Software Maintenance and Evolution (ICSME), pages 181–190, Sept. 2014.
[52]
J. Yang and L. Tan. Inferring semantically related words from software context. In 9th IEEE Working Conf. on Mining Software Repositories, pages 161–170, 2012.
[53]
X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT Int’l Symposium on Foundations of Software Engineering, pages 689–699, New York, NY, USA, 2014. ACM.
[54]
E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow. Metasearch and federation using query difficulty prediction. In 28th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, Workshop on Query prediction and its applications, 2005.
[55]
C. Zhai and J. La↵erty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, Apr. 2004.
[56]
Y. Zhou and W. B. Croft. Ranking robustness: a novel framework to predict query performance. In 15th ACM Int’l Conf. on Information and Knowledge Management, 2006.
[57]
Y. Zhou and W. B. Croft. Query performance prediction in web search environments. In 30th Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 543–550. ACM, 2007.

Cited By

View all
  • (2024)Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug LocalizationProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering10.1145/3643787.3648028(1-8)Online publication date: 20-Apr-2024
  • (2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: Oct-2024
  • (2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering
August 2015
1068 pages
ISBN:9781450336758
DOI:10.1145/2786805
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Configuration
  2. Feature and Bug Localization
  3. Text-Retrieval in Software Engineering

Qualifiers

  • Research-article

Conference

ESEC/FSE'15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug LocalizationProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering10.1145/3643787.3648028(1-8)Online publication date: 20-Apr-2024
  • (2024)RLocator: Reinforcement Learning for Bug LocalizationIEEE Transactions on Software Engineering10.1109/TSE.2024.345259550:10(2695-2708)Online publication date: Oct-2024
  • (2023)A Systematic Review of Automated Query Reformulations in Source Code SearchACM Transactions on Software Engineering and Methodology10.1145/360717932:6(1-79)Online publication date: 4-Jul-2023
  • (2023)Intelligent Software MaintenanceOptimising the Software Development Process with Artificial Intelligence10.1007/978-981-19-9948-2_9(241-275)Online publication date: 20-Jul-2023
  • (2022)An Aspects Framework for Component-Based Requirements Prediction and Regression TestingSustainability10.3390/su14211456314:21(14563)Online publication date: 5-Nov-2022
  • (2022)Predictive Models in Software Engineering: Challenges and OpportunitiesACM Transactions on Software Engineering and Methodology10.1145/350350931:3(1-72)Online publication date: 9-Apr-2022
  • (2022)Tracking Buggy Files: New Efficient Adaptive Bug Localization AlgorithmIEEE Transactions on Software Engineering10.1109/TSE.2021.306444748:7(2557-2569)Online publication date: 1-Jul-2022
  • (2022)The Effect of Feature Characteristics on the Performance of Feature Location TechniquesIEEE Transactions on Software Engineering10.1109/TSE.2021.304973548:6(2066-2085)Online publication date: 1-Jun-2022
  • (2022)DigBug—Pre/post-processing operator selection for accurate bug localizationJournal of Systems and Software10.1016/j.jss.2022.111300189(111300)Online publication date: Jul-2022
  • (2021)BoostNSift: A Query Boosting and Code Sifting Technique for Method Level Bug Localization2021 IEEE 21st International Working Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM52516.2021.00019(81-91)Online publication date: Sep-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media