Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2635868.2635875acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

On the localness of software

Published: 11 November 2014 Publication History

Abstract

The n-gram language model, which has its roots in statistical natural language processing, has been shown to successfully capture the repetitive and predictable regularities (“naturalness") of source code, and help with tasks such as code suggestion, porting, and designing assistive coding devices. However, we show in this paper that this natural-language-based model fails to exploit a special property of source code: localness. We find that human-written programs are localized: they have useful local regularities that can be captured and exploited. We introduce a novel cache language model that consists of both an n-gram and an added “cache" component to exploit localness. We show empirically that the additional cache component greatly improves the n-gram approach by capturing the localness of software, as measured by both cross-entropy and suggestion accuracy. Our model’s suggestion accuracy is actually comparable to a state-of-the-art, semantically augmented language model; but it is simpler and easier to implement. Our cache language model requires nothing beyond lexicalization, and thus is applicable to all programming languages.

References

[1]
S. L. Abebe and P. Tonella. Automated identifier completion and replacement. In CSMR, pages 263–272. IEEE, 2013.
[2]
G. Alkhatib. The maintenance problem of application software: an empirical analysis. Journal of Software Maintenance: Research and Practice, 4(2):83–104, 1992.
[3]
M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. ACM SIGSOFT FSE, 2014.
[4]
M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modelling. In MSR, pages 207–216, 2013.
[5]
S. C. Arnold, L. Mark, and J. Goldthwaite. Programming by voice, VocalProgramming. In ASSETS, pages 149–155. ACM, 2000.
[6]
L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179–190, 1983.
[7]
P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, 1990.
[8]
P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
[9]
M. Broy, F. Deißenböck, and M. Pizka. A holistic approach to software quality at work. In 3WCSQ, 2005.
[10]
M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In FSE, pages 213–222. ACM, 2009.
[11]
R. P. Buse and W. R. Weimer. Automatically documenting program changes. In ASE, pages 33–42. ACM, 2010.
[12]
J. C. Campbell, A. Hindle, and J. N. Amaral. Syntax errors just aren’t natural: improving error reporting with language models. In WCRE, pages 252–261. ACM, 2014.
[13]
B. Caprile and P. Tonella. Nomen est omen: Analyzing the language of function identifiers. In WCRE, pages 112–122. IEEE, 1999.
[14]
W. J. Dixon and A. M. Mood. The statistical sign test. Journal of the American Statistical Association, 41(236):557–566, 1946.
[15]
E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. In MSR, pages 71–80. IEEE, 2009.
[16]
M. Gabel and Z. Su. A study of the uniqueness of source code. In FSE, pages 147–156. ACM, 2010.
[17]
L. Guerrouj, M. Di Penta, G. Antoniol, and Y.-G. Guéhéneuc. Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software: Evolution and Process, 25(6):575–599, 2013.
[18]
A. Hindle, E. Barr, M. Gabel, Z. Su, and P. Devanbu. On the naturalness of software. In ICSE, pages 837–847. IEEE, 2012.
[19]
R. Holmes, R. J. Walker, and G. C. Murphy. Approximate structural context matching: An approach to recommend relevant examples. IEEE Transactions on Software Engineering, 32(12):952–970, 2006.
[20]
D. Hou and D. M. Pletcher. An evaluation of the strategies of sorting, filtering, and grouping API methods for code completion. In ICSM, pages 233–242. IEEE, 2011.
[21]
F. Jacob and R. Tairas. Code template inference using language models. In ACMSE, pages 104:1 – 104:6. ACM, 2010.
[22]
F. Jelinek, L. Bahl, and R. Mercer. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, 21(3):250–256, 1975.
[23]
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing, volume 35, pages 400–401. IEEE, 1987.
[24]
M. D. Kernighan, K. W. Church, and W. A. Gale. A spelling correction program based on a noisy channel model. In COLING, pages 205–210. Association for Computational Linguistics, 1990.
[25]
M. Kersten and G. C. Murphy. Using task context to improve programmer productivity. In FSE, pages 1–11. ACM, 2006.
[26]
K. Knight. Bayesian inference with tears. Tutorial Workbook, 2009.
[27]
S. Konrad and B. H. C. Cheng. Real-time specification patterns. In ICSE, pages 372–381. ACM, 2005.
[28]
A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: identifying topics in source code. Information and Software Technology, 49(3):230–243, 2007.
[29]
R. Kuhn and R. D. Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570–583, 1990.
[30]
D. Lawrie, H. Feild, and D. Binkley. Quantifying identifier quality: an analysis of trends. Empirical Software Engineering, 12(4):359–388, 2007.
[31]
B. Liblit, A. Begel, and E. Sweetser. Cognitive perspectives on the role of naming in computer programs. In PPIG. Psychology of Programming Interest Group, 2006.
[32]
D. Mandelin, L. Xu, R. Bodík, and D. Kimelman. Jungloid mining: helping to navigate the API jungle. In PLDI, pages 48–61. ACM, 2005.
[33]
D. Movshovitz-Attias and W. W. Cohen. Natural language models for predicting programming comments. In ACL, pages 35–40. Association for Computational Linguistics, 2013.
[34]
A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen, J. Al-Kofahi, and T. N. Nguyen. Graph-based pattern-oriented, context-sensitive source code completion. In ICSE, pages 69–79. IEEE, 2012.
[35]
A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen. Lexical statistical machine translation for language migration. In FSE, pages 651–654. ACM, 2013.
[36]
T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In FSE, pages 532–542. ACM, 2013.
[37]
D. L. Parnas, P. C. Clements, and D. M. Weiss. The modular structure of complex systems. In ICSE, pages 408–417. IEEE, 1984.
[38]
S. Rastkar, G. C. Murphy, and A. W. Bradley. Generating natural language summaries for cross-cutting source code concerns. In ICSM, pages 103–112. IEEE, 2011.
[39]
R. Robbes and M. Lanza. Improving code completion with program history. Automated Software Engineering, 17(2):181–212, 2010.
[40]
C. Rolland and C. Proix. A natural language approach for requirements engineering. In Advanced Information Systems Engineering, pages 257–277. Springer, 1992.
[41]
C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In SIGIR, pages 2–9. ACM, 2004.
[42]
D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented concerns. In AOSD, pages 212–224. ACM, 2007.
[43]
D. Shepherd, L. Pollock, and T. Tourwé. Using language clues to discover crosscutting concerns. ACM SIGSOFT Software Engineering Notes, 30(4):1–6, 2005.
[44]
G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for Java methods. In ASE, pages 43–52. ACM, 2010.
[45]
G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In ICSE, pages 101–110. ACM, 2011.
[46]
R. Srihari and C. Baltus. Combining statistical and syntactic methods in recognizing handwritten sentences. In AAAI Symposium: Probabilistic Approaches to Natural Language, pages 121–127, 1992.
[47]
W. F. Tichy and S. J. Koerner. Text to software: developing tools to close the gaps in software engineering. In FoSER, pages 379–384. ACM, 2010.
[48]
C. Zhang, J. Yang, Y. Zhang, J. Fan, X. Zhang, J. Zhao, and P. Ou. Automatic parameter recommendation for practical API usage. In ICSE, pages 826–836. IEEE, 2012.
[49]
H. Zhong and Z. Su. Detecting API documentation errors. In OOPSLA, pages 803–816. ACM, 2013.
[50]
H. Zhong, T. Xie, L. Zhang, J. Pei, and H. Mei. MAPO: mining and recommending API usage patterns. In ECOOP, pages 318–343. Springer, 2009.
[51]
H. Zhong, L. Zhang, T. Xie, and H. Mei. Inferring specifications for resources from natural language API documentation. Automated Software Engineering, 18(3-4):227–261, 2011.

Cited By

View all
  • (2024)REPOFORMERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694253(53270-53290)Online publication date: 21-Jul-2024
  • (2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
  • (2024)VulAdvisor: Natural Language Suggestion Generation for Software Vulnerability RepairProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695555(1932-1944)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2014
856 pages
ISBN:9781450330565
DOI:10.1145/2635868
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cache Language Model
  2. Code Suggestion
  3. Localness

Qualifiers

  • Research-article

Conference

SIGSOFT/FSE'14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)7
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)REPOFORMERProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694253(53270-53290)Online publication date: 21-Jul-2024
  • (2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
  • (2024)VulAdvisor: Natural Language Suggestion Generation for Software Vulnerability RepairProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695555(1932-1944)Online publication date: 27-Oct-2024
  • (2024)Dependency-Aware Code NaturalnessProceedings of the ACM on Programming Languages10.1145/36897948:OOPSLA2(2355-2377)Online publication date: 8-Oct-2024
  • (2024)Non-Autoregressive Line-Level Code CompletionACM Transactions on Software Engineering and Methodology10.1145/364959433:5(1-34)Online publication date: 26-Feb-2024
  • (2024)Exploring and Improving Code Completion for Test CodeProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644421(137-148)Online publication date: 15-Apr-2024
  • (2024)Generating Java Methods: An Empirical Assessment of Four AI-Based Code AssistantsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644402(13-23)Online publication date: 15-Apr-2024
  • (2024)Smart Contract Code Repair Recommendation based on Reinforcement Learning and Multi-metric OptimizationACM Transactions on Software Engineering and Methodology10.1145/363722933:4(1-31)Online publication date: 18-Apr-2024
  • (2024)Deep Is Better? An Empirical Comparison of Information Retrieval and Deep Learning Approaches to Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363197533:3(1-37)Online publication date: 15-Mar-2024
  • (2024)Code Search is All You Need? Improving Code Suggestions with Code SearchProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639085(1-13)Online publication date: 20-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media