Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.
Annotator experience ranged from second year students to practicing professionals with almost fifty years of experience. The average experience was 13.1 years while the median was 7.0 years and the standard deviation 12.8 years.
Information concerning all of these splitters as well as how each split the identifiers in the oracle can be found in the replication package at www.cs.loyola.edu/~lawrie/id-splitting-data.
Appendix: Instructions to Annotators
Appendix: Instructions to Annotators
The following rather minimal instructions were given to the annotators when asking them to provide the oracle version of the split of the identifiers:
What: Please split some program identifiers into atomic units by adding spaces. We consider atomic units to be individual words or abbreviations. Some splits are easily recognized from artifacts in the identifier. Those splits will be automatically inserted. Here are some examples:
“theblueHouse” → “the blue House”
“FDARequirement” → “FDA Requirement”
“unparse_voidptr” → “unparse void ptr”
Some are easy. Some are hard. So let us know when you guess.
Purpose: We are developing algorithms to automatically determine the most likely splits of program identifiers. An automatic identifier splitter is the first important step in a variety of automatic analysis of software natural language. Your splitting decisions will help to guide and evaluate our research on automatic identifier splitting. The split collection of identifiers will be made publicly available.
Disclaimer: Your identity will not be revealed.
Thanks for helping us out!
Dave, Dawn, Emily, Lori, and Vijay
