An empirical study of identifier splitting techniques

Hill, Emily; Binkley, David; Lawrie, Dawn; Pollock, Lori; Vijay-Shanker, K.

doi:10.1007/s10664-013-9261-0

An empirical study of identifier splitting techniques

Published: 08 August 2013

Volume 19, pages 1754–1780, (2014)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Emily Hill¹,
David Binkley²,
Dawn Lawrie²,
Lori Pollock³ &
…
K. Vijay-Shanker³

771 Accesses
41 Citations
Explore all metrics

Abstract

Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SourcererCC: Scalable and Accurate Clone Detection

Effect of Identifier Tokenization on Automatic Source Code Documentation

Article 12 September 2021

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Article 20 July 2023

Notes

Annotator experience ranged from second year students to practicing professionals with almost fifty years of experience. The average experience was 13.1 years while the median was 7.0 years and the standard deviation 12.8 years.
Information concerning all of these splitters as well as how each split the identifiers in the oracle can be found in the replication package at www.cs.loyola.edu/~lawrie/id-splitting-data.

References

Atkinson K (2004) Spell checking oriented word lists (scowl). http://wordlist.sourceforge.net/. Accessed 13 July 2013
Binkley D, Davis M, Lawrie D, Maletic J, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 18:219–276. doi:10.1007/s10664-012-9201-4
Article Google Scholar
Brants T, Franz A: Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, Philadelphia
Butler S, Wermelinger M, Yu Y, Sharp H (2011) Improving the tokenisation of identifier names. In: Proceedings of the 25th European conference on object-oriented programming, ECOOP’11. Springer-Verlag, Berlin, Heidelberg, pp 130–154. http://dl.acm.org/citation.cfm?id=2032497.2032507
Caprile B, Tonella P (1999) Nomen est omen: Analyzing the language of function identifiers. In: WCRE ’99: Proceedings of the 6th working conference on reverse engineering, pp 112–122
Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM ’00: Proceedings of the International Conference on Software Maintenance (ICSM’00). IEEE Computer Society, Washington, DC, USA, p 97
Corazza A, Martino SD, Maggio V (2012) Linsen: An approach to split identifiers and expand abbreviations with linear complexity. In: Proceedings of the 2012 IEEE International Conference on Software Maintenance, ICSM ’12. IEEE Computer Society, Washington, DC, USA
Deissenboeck F, Pizka M (2006) Concise and consistent naming. J Soft Quality Control 14(3):261–282. doi:10.1007/s11219-006-9219-1
Article Google Scholar
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: 2011 IEEE 19th International Conference on Program Comprehension (ICPC), pp 11–20. doi:10.1109/ICPC.2011.47
Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009, 71–80. doi:10.1109/MSR.2009.5069482
Feild H, Binkley D, Lawrie D (2006) An empirical comparison of techniques for extracting concept abbreviations from identifiers. In: Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06)
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2011) Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software Maintenance and Evolution: Research and Practice. doi:10.1002/smr.539
Google Scholar
Hill E, Fry ZP, Boyd H, Sridhara G, Novikova Y, Pollock L, Vijay-Shanker K (2008) AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In: MSR ’08: Proceedings of the 5th international working conference on mining software repositories. IEEE Computer Society, Washington, DC, USA
Lawrie D, Binkley D (2011) Expanding identifiers to normalizing source code vocabulary. In: ICSM ’11: Proceedings of the 27th IEEE international conference on software maintenance
Lawrie D, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: 17th Working Conference on Reverse Engineering (WCRE), pp 3–12. doi:10.1109/WCRE.2010.10
Lawrie D, Feild H, Binkley D (2007a) Extracting meaning from abbreviated identifiers. In: SCAM ’07: Proceedings of the 7th IEEE International working conference on Source Code Analysis and Manipulation (SCAM 2007), pp 213–222. doi:10.1109/SCAM.2007.9
Lawrie D, Feild H, Binkley D (2007b) Quantifying identifier quality: an analysis of trends. J Emp Soft Eng 12(4):359–388
Article Google Scholar
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th Annual Psychology of Programming Workshop
Madani N, Guerrouj L, Di Penta M, Gueheneuc Y, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), pp 68–77. doi:10.1109/CSMR.2010.31
Nie J, Gao J, He H, Chen W, Zhou M (2002) Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: SIGIR ’02: Proceedings of the 2002 SIGIR. ACM, New York, NY, USA
Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis, 5th edn. Duxbury
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE ’07: Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, Washington, DC, USA, pp 499–510. doi:10.1109/ICSE.2007.32

Download references

Acknowledgements

Special thanks to all the participants as this work would not be possible without your time and also to Chris Morrell for help with the statistics. Support for this work was provided by NSF grant CCF 0916081.

Author information

Authors and Affiliations

Department of Computer Science, Montclair State University, Montclair, NJ, 07043, USA
Emily Hill
Department of Computer Science, Loyola University Maryland, Baltimore, MD, 21210, USA
David Binkley & Dawn Lawrie
Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19716, USA
Lori Pollock & K. Vijay-Shanker

Authors

Emily Hill
View author publications
You can also search for this author in PubMed Google Scholar
David Binkley
View author publications
You can also search for this author in PubMed Google Scholar
Dawn Lawrie
View author publications
You can also search for this author in PubMed Google Scholar
Lori Pollock
View author publications
You can also search for this author in PubMed Google Scholar
K. Vijay-Shanker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emily Hill.

Additional information

Communicated by: Giulio Antoniol

Appendix: Instructions to Annotators

The following rather minimal instructions were given to the annotators when asking them to provide the oracle version of the split of the identifiers:

What: Please split some program identifiers into atomic units by adding spaces. We consider atomic units to be individual words or abbreviations. Some splits are easily recognized from artifacts in the identifier. Those splits will be automatically inserted. Here are some examples:

“theblueHouse” → “the blue House”

“FDARequirement” → “FDA Requirement”

“unparse_voidptr” → “unparse void ptr”

Some are easy. Some are hard. So let us know when you guess.

Purpose: We are developing algorithms to automatically determine the most likely splits of program identifiers. An automatic identifier splitter is the first important step in a variety of automatic analysis of software natural language. Your splitting decisions will help to guide and evaluate our research on automatic identifier splitting. The split collection of identifiers will be made publicly available.

Disclaimer: Your identity will not be revealed.

Thanks for helping us out!

Dave, Dawn, Emily, Lori, and Vijay

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hill, E., Binkley, D., Lawrie, D. et al. An empirical study of identifier splitting techniques. Empir Software Eng 19, 1754–1780 (2014). https://doi.org/10.1007/s10664-013-9261-0

Download citation

Published: 08 August 2013
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10664-013-9261-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical study of identifier splitting techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SourcererCC: Scalable and Accurate Clone Detection

Effect of Identifier Tokenization on Automatic Source Code Documentation

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Instructions to Annotators

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An empirical study of identifier splitting techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SourcererCC: Scalable and Accurate Clone Detection

Effect of Identifier Tokenization on Automatic Source Code Documentation

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Instructions to Annotators

Appendix: Instructions to Annotators

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation