research-article

On the localness of software

Authors:

Premkumar DevanbuAuthors Info & Claims

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 269 - 280

https://doi.org/10.1145/2635868.2635875

Published: 11 November 2014 Publication History

Abstract

The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness") of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache" component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.

References

[1]

S. L. Abebe and P. Tonella. Automated identifier completion and replacement. In CSMR, pages 263–272. IEEE, 2013.

Digital Library

[2]

G. Alkhatib. The maintenance problem of application software: an empirical analysis. Journal of Software Maintenance: Research and Practice, 4(2):83–104, 1992.

Digital Library

[3]

M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. ACM SIGSOFT FSE, 2014.

Digital Library

[4]

M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modelling. In MSR, pages 207–216, 2013.

Digital Library

[5]

S. C. Arnold, L. Mark, and J. Goldthwaite. Programming by voice, VocalProgramming. In ASSETS, pages 149–155. ACM, 2000.

Digital Library

[6]

L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179–190, 1983.

Digital Library

[7]

P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, 1990.

Digital Library

[8]

P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.

Digital Library

[9]

M. Broy, F. Deißenböck, and M. Pizka. A holistic approach to software quality at work. In 3WCSQ, 2005.

[10]

M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In FSE, pages 213–222. ACM, 2009.

Digital Library

[11]

R. P. Buse and W. R. Weimer. Automatically documenting program changes. In ASE, pages 33–42. ACM, 2010.

Digital Library

[12]

J. C. Campbell, A. Hindle, and J. N. Amaral. Syntax errors just aren’t natural: improving error reporting with language models. In WCRE, pages 252–261. ACM, 2014.

[13]

B. Caprile and P. Tonella. Nomen est omen: Analyzing the language of function identifiers. In WCRE, pages 112–122. IEEE, 1999.

Digital Library

[14]

W. J. Dixon and A. M. Mood. The statistical sign test. Journal of the American Statistical Association, 41(236):557–566, 1946.

[15]

E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. In MSR, pages 71–80. IEEE, 2009.

Digital Library

[16]

M. Gabel and Z. Su. A study of the uniqueness of source code. In FSE, pages 147–156. ACM, 2010.

Digital Library

[17]

L. Guerrouj, M. Di Penta, G. Antoniol, and Y.-G. Guéhéneuc. Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software: Evolution and Process, 25(6):575–599, 2013.

[18]

A. Hindle, E. Barr, M. Gabel, Z. Su, and P. Devanbu. On the naturalness of software. In ICSE, pages 837–847. IEEE, 2012.

Digital Library

[19]

R. Holmes, R. J. Walker, and G. C. Murphy. Approximate structural context matching: An approach to recommend relevant examples. IEEE Transactions on Software Engineering, 32(12):952–970, 2006.

Digital Library

[20]

D. Hou and D. M. Pletcher. An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion. In ICSM, pages 233–242. IEEE, 2011.

Digital Library

[21]

F. Jacob and R. Tairas. Code template inference using language models. In ACMSE, pages 104:1 – 104:6. ACM, 2010.

Digital Library

[22]

F. Jelinek, L. Bahl, and R. Mercer. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, 21(3):250–256, 1975.

Digital Library

[23]

S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing, volume 35, pages 400–401. IEEE, 1987.

[24]

M. D. Kernighan, K. W. Church, and W. A. Gale. A spelling correction program based on a noisy channel model. In COLING, pages 205–210. Association for Computational Linguistics, 1990.

Digital Library

[25]

M. Kersten and G. C. Murphy. Using task context to improve programmer productivity. In FSE, pages 1–11. ACM, 2006.

Digital Library

[26]

K. Knight. Bayesian inference with tears. Tutorial Workbook, 2009.

[27]

S. Konrad and B. H. C. Cheng. Real-time specification patterns. In ICSE, pages 372–381. ACM, 2005.

Digital Library

[28]

A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: identifying topics in source code. Information and Software Technology, 49(3):230–243, 2007.

Digital Library

[29]

R. Kuhn and R. D. Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570–583, 1990.

Digital Library

[30]

D. Lawrie, H. Feild, and D. Binkley. Quantifying identifier quality: an analysis of trends. Empirical Software Engineering, 12(4):359–388, 2007.

Digital Library

[31]

B. Liblit, A. Begel, and E. Sweetser. Cognitive perspectives on the role of naming in computer programs. In PPIG. Psychology of Programming Interest Group, 2006.

[32]

D. Mandelin, L. Xu, R. Bodík, and D. Kimelman. Jungloid mining: helping to navigate the API jungle. In PLDI, pages 48–61. ACM, 2005.

Digital Library

[33]

D. Movshovitz-Attias and W. W. Cohen. Natural language models for predicting programming comments. In ACL, pages 35–40. Association for Computational Linguistics, 2013.

[34]

A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen, J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-sensitive source code completion. In ICSE, pages 69–79. IEEE, 2012.

Digital Library

[35]

A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Lexical statistical machine translation for language migration. In FSE, pages 651–654. ACM, 2013.

Digital Library

[36]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In FSE, pages 532–542. ACM, 2013.

Digital Library

[37]

D. L. Parnas, P. C. Clements, and D. M. Weiss. The modular structure of complex systems. In ICSE, pages 408–417. IEEE, 1984.

Digital Library

[38]

S. Rastkar, G. C. Murphy, and A. W. Bradley. Generating natural language summaries for cross-cutting source code concerns. In ICSM, pages 103–112. IEEE, 2011.

Digital Library

[39]

R. Robbes and M. Lanza. Improving code completion with program history. Automated Software Engineering, 17(2):181–212, 2010.

Digital Library

[40]

C. Rolland and C. Proix. A natural language approach for requirements engineering. In Advanced Information Systems Engineering, pages 257–277. Springer, 1992.

[41]

C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In SIGIR, pages 2–9. ACM, 2004.

Digital Library

[42]

D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented concerns. In AOSD, pages 212–224. ACM, 2007.

Digital Library

[43]

D. Shepherd, L. Pollock, and T. Tourwé. Using language clues to discover crosscutting concerns. ACM SIGSOFT Software Engineering Notes, 30(4):1–6, 2005.

Digital Library

[44]

G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for Java methods. In ASE, pages 43–52. ACM, 2010.

Digital Library

[45]

G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In ICSE, pages 101–110. ACM, 2011.

Digital Library

[46]

R. Srihari and C. Baltus. Combining statistical and syntactic methods in recognizing handwritten sentences. In AAAI Symposium: Probabilistic Approaches to Natural Language, pages 121–127, 1992.

[47]

W. F. Tichy and S. J. Koerner. Text to software: developing tools to close the gaps in software engineering. In FoSER, pages 379–384. ACM, 2010.

Digital Library

[48]

C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou. Automatic parameter recommendation for practical API usage. In ICSE, pages 826–836. IEEE, 2012.

Digital Library

[49]

H. Zhong and Z. Su. Detecting API documentation errors. In OOPSLA, pages 803–816. ACM, 2013.

Digital Library

[50]

H. Zhong, T. Xie, L. Zhang, J. Pei, and H. Mei. MAPO: mining and recommending API usage patterns. In ECOOP, pages 318–343. Springer, 2009.

Digital Library

[51]

H. Zhong, L. Zhang, T. Xie, and H. Mei. Inferring specifications for resources from natural language API documentation. Automated Software Engineering, 18(3-4):227–261, 2011.

Digital Library

Cited By

Wu DAhmad WZhang DRamanathan MMa XSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)REPOFORMERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694253(53270-53290)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694253
Bagheri AHegedűs P(2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
https://doi.org/10.14232/actacyb.299667
Zhang JWang CLi AWang WLi TLiu YFilkov VRay BZhou M(2024)VulAdvisor: Natural Language Suggestion Generation for Software Vulnerability RepairProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695555(1932-1944)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695555
Show More Cited By

Index Terms

On the localness of software
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. Software management
        Software maintenance
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues

Recommendations

Code Search is All You Need? Improving Code Suggestions with Code Search
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Modern integrated development environments (IDEs) provide various automated code suggestion techniques (e.g., code completion and code generation) to help developers improve their efficiency. Such techniques may retrieve similar code snippets from the ...
Barriers to the Localness of Volunteered Geographic Information
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems

Localness is an oft-cited benefit of volunteered geographic information (VGI). This study examines whether localness is a constant, universally shared benefit of VGI, or one that varies depending on the context in which it is produced. Focusing on ...
Multi-location cryptographic code repair with neural-network-based methodologies
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Java Cryptographic API libraries are error-prone and result in vulnerabilities. The fixes of them often require security expertise and extra consideration for cryptographic consistency at multiple code locations. My Ph.D. research aims to help ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

November 2014

856 pages

ISBN:9781450330565

DOI:10.1145/2635868

General Chair:
Shing-Chi Cheung
Hong Kong University of Science and Technology, China
,
Program Chairs:
Alessandro Orso
Georgia Institute of Technology, USA
,
Margaret-Anne Storey
University of Victoria, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGSOFT/FSE'14

Sponsor:

SIGSOFT

SIGSOFT/FSE'14: 22nd ACM SIGSOFT Symposium on the Foundations of Software Engineering

November 16 - 21, 2014

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

175
Total Citations
View Citations
1,148
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)7

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu DAhmad WZhang DRamanathan MMa XSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)REPOFORMERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694253(53270-53290)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694253
Bagheri AHegedűs P(2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
https://doi.org/10.14232/actacyb.299667
Zhang JWang CLi AWang WLi TLiu YFilkov VRay BZhou M(2024)VulAdvisor: Natural Language Suggestion Generation for Software Vulnerability RepairProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695555(1932-1944)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695555
Yang CChen JJiang JHuang Y(2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689794
Liu FFu ZLi GJin ZLiu HHao YZhang L(2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1145/3649594
Zhu TLiu ZXu TTang ZZhang TPan MXia XBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644421
Corso VMariani LMicucci DRiganelli OBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Generating Java Methods: An Empirical Assessment of Four AI-Based Code AssistantsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644402(13-23)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644402
Guo HChen YChen XHuang YZheng Z(2024)Smart Contract Code Repair Recommendation based on Reinforcement Learning and Multi-metric OptimizationACM Transactions on Software Engineering and Methodology10.1145/363722933:4(1-31)Online publication date: 18-Apr-2024
https://dl.acm.org/doi/10.1145/3637229
Zhu TLi ZPan MShi CZhang TPei YLi X(2024)Deep Is Better? An Empirical Comparison of Information Retrieval and Deep Learning Approaches to Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363197533:3(1-37)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3631975
Chen JHu XLi ZGao CXia XLo DRoychoudhury APaiva AAbreu RStorey M(2024)Code Search is All You Need? Improving Code Suggestions with Code SearchProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639085(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639085
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten