research-article

Syntax-aware on-the-fly code completion

Authors:

Wannita Takerngsaksiri,

Chakkrit Tantithamthavorn,

Yuan-Fang LiAuthors Info & Claims

Volume 165, Issue C

https://doi.org/10.1016/j.infsof.2023.107336

Published: 01 January 2024 Publication History

Abstract

Context:

Code completion aims to help improve developers’ productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet.

Objective:

In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code.

Method:

Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase.

Results:

Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%–24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%–84.73% more accurate than baselines.

Conclusions:

These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace and GitHub.

References

[1]

Code V.S., IntelliSense in visual studio code, 2022, URL https://code.visualstudio.com/docs/editor/intellisense.

[2]

Maxim Tabachnyk S.S.E., Stoyan Nikolov, Senior Engineering Manager G.R., ML-enhanced code completion improves developer productivity, 2022, URL https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html.

[3]

Lu S., Guo D., Ren S., Huang J., Svyatkovskiy A., Blanco A., Clement C., Drain D., Jiang D., Tang D., Li G., Zhou L., Shou L., Zhou L., Tufano M., Gong M., Zhou M., Duan N., Sundaresan N., Deng S.K., Fu S., Liu S., CodeXGLUE: A machine learning benchmark dataset for code understanding and generation, in: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021, URL https://openreview.net/forum?id=6lE4dQXaUcb.

[4]

Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I., et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9.

[5]

A. Svyatkovskiy, S.K. Deng, S. Fu, N. Sundaresan, Intellicode compose: Code generation using transformer, in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1433–1443.

[6]

Kim S., Zhao J., Tian Y., Chandra S., Code prediction by feeding trees to transformers, in: 2021 IEEE/ACM 43rd International Conference on Software Engineering, (ICSE), IEEE, 2021, pp. 150–162.

[7]

Izadi M., Gismondi R., Gousios G., CodeFill: Multi-token code completion by jointly learning from structure and naming sequences, 2022, arXiv preprint arXiv:2202.06689.

[8]

Brockschmidt M., Allamanis M., Gaunt A.L., Polozov O., Generative code modeling with graphs, 2018, arXiv preprint arXiv:1805.08490.

[9]

Li J., Wang Y., Lyu M.R., King I., Code completion with neural attention and pointer networks, 2017, arXiv preprint arXiv:1711.09573.

[10]

A. Svyatkovskiy, Y. Zhao, S. Fu, N. Sundaresan, Pythia: Ai-assisted code completion system, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2727–2735.

[11]

F. Liu, G. Li, B. Wei, X. Xia, Z. Fu, Z. Jin, A self-attentional neural architecture for code completion with multi-task learning, in: Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 37–47.

[12]

Liu F., Li G., Wei B., Xia X., Fu Z., Jin Z., A unified multi-task learning model for AST-level and token-level code completion, Empir. Softw. Eng. 27 (4) (2022) 1–38.

[13]

Guo D., Lu S., Duan N., Wang Y., Zhou M., Yin J., Unixcoder: Unified cross-modal pre-training for code representation, 2022, arXiv preprint arXiv:2203.03850.

[14]

Raychev V., Bielik P., Vechev M., Probabilistic model for code with decision trees, ACM SIGPLAN Notices 51 (10) (2016) 731–747.

[15]

D. Hou, D.M. Pletcher, Towards a better code completion system by API grouping, filtering, and popularity-based ranking, in: Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering, 2010, pp. 26–30.

[16]

Robbes R., Lanza M., How program history can improve code completion, in: 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, IEEE, 2008, pp. 317–326.

[17]

M. Bruch, M. Monperrus, M. Mezini, Learning from examples to improve code completion systems, in: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2009, pp. 213–222.

[18]

Hindle A., Barr E.T., Su Z., Gabel M., Devanbu P., On the naturalness of software, in: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, IEEE Press, 2012, pp. 837–847.

[19]

Hindle A., Barr E.T., Gabel M., Su Z., Devanbu P., On the naturalness of software, Commun. ACM 59 (5) (2016) 122–131.

[20]

Y. Wang, H. Li, Code completion by modeling flattened abstract syntax trees as graphs, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, (16) 2021, pp. 14015–14023.

[21]

Sennrich R., Haddow B., Birch A., Neural machine translation of rare words with subword units, 2015, arXiv preprint arXiv:1508.07909.

[22]

Fu M., Tantithamthavorn C., GPT2SP: A transformer-based agile story point estimation approach, IEEE Trans. Softw. Eng. (2022).

[23]

Karampatsis R.-M., Babii H., Robbes R., Sutton C., Janes A., Big code!=big vocabulary: Open-vocabulary models for source code, in: 2020 IEEE/ACM 42nd International Conference on Software Engineering, (ICSE), IEEE, 2020, pp. 1073–1085.

[24]

Thongtanunam P., Pornprasit C., Tantithamthavorn C., AutoTransform: Automated code transformation to support modern code review process, 2022.

[25]

Phang J., Févry T., Bowman S.R., Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks, 2018, arXiv preprint arXiv:1811.01088.

[26]

Ruder S., An overview of multi-task learning in deep neural networks, 2017, arXiv preprint arXiv:1706.05098.

[27]

Golub G.H., Van Loan C.F., Matrix Computations, JHU Press, 2013.

[28]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, in: Advances in Neural Information Processing Systems, Vol. 30, 2017.

[29]

Husain H., Wu H.-H., Gazit T., Allamanis M., Brockschmidt M., Codesearchnet challenge: Evaluating the state of semantic code search, 2019, arXiv preprint arXiv:1909.09436.

[30]

Holtzman A., Buys J., Du L., Forbes M., Choi Y., The curious case of neural text degeneration, 2019, arXiv preprint arXiv:1904.09751.

[31]

Li J., Monroe W., Ritter A., Jurafsky D., Galley M., Gao J., Deep reinforcement learning for dialogue generation, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 1192–1202,. URL https://aclanthology.org/D16-1127.

[32]

Wiseman S., Shieber S., Rush A., Challenges in data-to-document generation, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2253–2263,. URL https://aclanthology.org/D17-1239.

[33]

Ackley D.H., Hinton G.E., Sejnowski T.J., A learning algorithm for Boltzmann machines, Cogn. Sci. 9 (1) (1985) 147–169.

[34]

Vandenhende S., Georgoulis S., Van Gansbeke W., Proesmans M., Dai D., Van Gool L., Multi-task learning for dense prediction tasks: A survey, IEEE Trans. Pattern Anal. Mach. Intell. (2021).

[35]

Sener O., Koltun V., Multi-task learning as multi-objective optimization, in: Advances in Neural Information Processing Systems, Vol. 31, 2018.

[36]

Wang W., Shen S., Li G., Jin Z., Towards full-line code completion with neural language models, 2020, arXiv preprint arXiv:2009.08603.

[37]

Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al., Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems, Vol. 32, 2019.

[38]

Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., Moi A., Cistac P., Rault T., Louf R., Funtowicz M., et al., Huggingface’s transformers: State-of-the-art natural language processing, 2019, arXiv preprint arXiv:1910.03771.

[39]

Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.

[40]

Levenshtein V.I., Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Sov. Phys. Doklady 10 (1966) 707.

[41]

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

[42]

S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/Or Summarization, 2005, pp. 65–72.

[43]

Lin C.-Y., Rouge: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, 2004, pp. 74–81.

[44]

C.Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, (ACL-04), 2004, pp. 605–612.

[45]

Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., Liu P.J., et al., Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (140) (2020) 1–67.

[46]

F. Liu, G. Li, Y. Zhao, Z. Jin, Multi-task learning based pre-trained language model for code completion, in: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 473–485.

Cited By

Liu YTantithamthavorn CLiu YLi L(2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641540

Index Terms

Syntax-aware on-the-fly code completion
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploring and Improving Code Completion for Test Code
ICPC '24: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension

Code completion is an important feature in Integrated Development Environments (IDEs). These years, researchers have been making efforts for intelligent code completion. However, existing work on intelligent code completion either only considered ...
Principled syntactic code completion using placeholders
SLE 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Software Language Engineering

Principled syntactic code completion enables developers to change source code by inserting code templates, thus increasing developer efficiency and supporting language exploration. However, existing code completion systems are ad-hoc and neither ...
Multi-task learning based pre-trained language model for code completion
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown ...

Comments

Information & Contributors

Information

Published In

cover image Information and Software Technology

Information and Software Technology Volume 165, Issue C

Jan 2024

193 pages

Issue’s Table of Contents

The Author(s).

Publisher

Butterworth-Heinemann

United States

Publication History

Published: 01 January 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YTantithamthavorn CLiu YLi L(2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641540

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents