Abstract
Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.
Similar content being viewed by others
Notes
Annotator experience ranged from second year students to practicing professionals with almost fifty years of experience. The average experience was 13.1 years while the median was 7.0 years and the standard deviation 12.8 years.
Information concerning all of these splitters as well as how each split the identifiers in the oracle can be found in the replication package at www.cs.loyola.edu/~lawrie/id-splitting-data.
References
Atkinson K (2004) Spell checking oriented word lists (scowl). http://wordlist.sourceforge.net/. Accessed 13 July 2013
Binkley D, Davis M, Lawrie D, Maletic J, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 18:219–276. doi:10.1007/s10664-012-9201-4
Brants T, Franz A: Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, Philadelphia
Butler S, Wermelinger M, Yu Y, Sharp H (2011) Improving the tokenisation of identifier names. In: Proceedings of the 25th European conference on object-oriented programming, ECOOP’11. Springer-Verlag, Berlin, Heidelberg, pp 130–154. http://dl.acm.org/citation.cfm?id=2032497.2032507
Caprile B, Tonella P (1999) Nomen est omen: Analyzing the language of function identifiers. In: WCRE ’99: Proceedings of the 6th working conference on reverse engineering, pp 112–122
Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM ’00: Proceedings of the International Conference on Software Maintenance (ICSM’00). IEEE Computer Society, Washington, DC, USA, p 97
Corazza A, Martino SD, Maggio V (2012) Linsen: An approach to split identifiers and expand abbreviations with linear complexity. In: Proceedings of the 2012 IEEE International Conference on Software Maintenance, ICSM ’12. IEEE Computer Society, Washington, DC, USA
Deissenboeck F, Pizka M (2006) Concise and consistent naming. J Soft Quality Control 14(3):261–282. doi:10.1007/s11219-006-9219-1
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: 2011 IEEE 19th International Conference on Program Comprehension (ICPC), pp 11–20. doi:10.1109/ICPC.2011.47
Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009, 71–80. doi:10.1109/MSR.2009.5069482
Feild H, Binkley D, Lawrie D (2006) An empirical comparison of techniques for extracting concept abbreviations from identifiers. In: Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06)
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2011) Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software Maintenance and Evolution: Research and Practice. doi:10.1002/smr.539
Hill E, Fry ZP, Boyd H, Sridhara G, Novikova Y, Pollock L, Vijay-Shanker K (2008) AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In: MSR ’08: Proceedings of the 5th international working conference on mining software repositories. IEEE Computer Society, Washington, DC, USA
Lawrie D, Binkley D (2011) Expanding identifiers to normalizing source code vocabulary. In: ICSM ’11: Proceedings of the 27th IEEE international conference on software maintenance
Lawrie D, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: 17th Working Conference on Reverse Engineering (WCRE), pp 3–12. doi:10.1109/WCRE.2010.10
Lawrie D, Feild H, Binkley D (2007a) Extracting meaning from abbreviated identifiers. In: SCAM ’07: Proceedings of the 7th IEEE International working conference on Source Code Analysis and Manipulation (SCAM 2007), pp 213–222. doi:10.1109/SCAM.2007.9
Lawrie D, Feild H, Binkley D (2007b) Quantifying identifier quality: an analysis of trends. J Emp Soft Eng 12(4):359–388
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th Annual Psychology of Programming Workshop
Madani N, Guerrouj L, Di Penta M, Gueheneuc Y, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), pp 68–77. doi:10.1109/CSMR.2010.31
Nie J, Gao J, He H, Chen W, Zhou M (2002) Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: SIGIR ’02: Proceedings of the 2002 SIGIR. ACM, New York, NY, USA
Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis, 5th edn. Duxbury
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE ’07: Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, Washington, DC, USA, pp 499–510. doi:10.1109/ICSE.2007.32
Acknowledgements
Special thanks to all the participants as this work would not be possible without your time and also to Chris Morrell for help with the statistics. Support for this work was provided by NSF grant CCF 0916081.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Giulio Antoniol
Appendix: Instructions to Annotators
Appendix: Instructions to Annotators
The following rather minimal instructions were given to the annotators when asking them to provide the oracle version of the split of the identifiers:
What: Please split some program identifiers into atomic units by adding spaces. We consider atomic units to be individual words or abbreviations. Some splits are easily recognized from artifacts in the identifier. Those splits will be automatically inserted. Here are some examples:
“theblueHouse” → “the blue House”
“FDARequirement” → “FDA Requirement”
“unparse_voidptr” → “unparse void ptr”
Some are easy. Some are hard. So let us know when you guess.
Purpose: We are developing algorithms to automatically determine the most likely splits of program identifiers. An automatic identifier splitter is the first important step in a variety of automatic analysis of software natural language. Your splitting decisions will help to guide and evaluate our research on automatic identifier splitting. The split collection of identifiers will be made publicly available.
Disclaimer: Your identity will not be revealed.
Thanks for helping us out!
Dave, Dawn, Emily, Lori, and Vijay
Rights and permissions
About this article
Cite this article
Hill, E., Binkley, D., Lawrie, D. et al. An empirical study of identifier splitting techniques. Empir Software Eng 19, 1754–1780 (2014). https://doi.org/10.1007/s10664-013-9261-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-013-9261-0