Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Syntax-aware on-the-fly code completion

Published: 01 January 2024 Publication History

Abstract

Context:

Code completion aims to help improve developers’ productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet.

Objective:

In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code.

Method:

Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase.

Results:

Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%–24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%–84.73% more accurate than baselines.

Conclusions:

These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace and GitHub.

References

[1]
Code V.S., IntelliSense in visual studio code, 2022, URL https://code.visualstudio.com/docs/editor/intellisense.
[2]
Maxim Tabachnyk S.S.E., Stoyan Nikolov, Senior Engineering Manager G.R., ML-enhanced code completion improves developer productivity, 2022, URL https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html.
[3]
Lu S., Guo D., Ren S., Huang J., Svyatkovskiy A., Blanco A., Clement C., Drain D., Jiang D., Tang D., Li G., Zhou L., Shou L., Zhou L., Tufano M., Gong M., Zhou M., Duan N., Sundaresan N., Deng S.K., Fu S., Liu S., CodeXGLUE: A machine learning benchmark dataset for code understanding and generation, in: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021, URL https://openreview.net/forum?id=6lE4dQXaUcb.
[4]
Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I., et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9.
[5]
A. Svyatkovskiy, S.K. Deng, S. Fu, N. Sundaresan, Intellicode compose: Code generation using transformer, in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1433–1443.
[6]
Kim S., Zhao J., Tian Y., Chandra S., Code prediction by feeding trees to transformers, in: 2021 IEEE/ACM 43rd International Conference on Software Engineering, (ICSE), IEEE, 2021, pp. 150–162.
[7]
Izadi M., Gismondi R., Gousios G., CodeFill: Multi-token code completion by jointly learning from structure and naming sequences, 2022, arXiv preprint arXiv:2202.06689.
[8]
Brockschmidt M., Allamanis M., Gaunt A.L., Polozov O., Generative code modeling with graphs, 2018, arXiv preprint arXiv:1805.08490.
[9]
Li J., Wang Y., Lyu M.R., King I., Code completion with neural attention and pointer networks, 2017, arXiv preprint arXiv:1711.09573.
[10]
A. Svyatkovskiy, Y. Zhao, S. Fu, N. Sundaresan, Pythia: Ai-assisted code completion system, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2727–2735.
[11]
F. Liu, G. Li, B. Wei, X. Xia, Z. Fu, Z. Jin, A self-attentional neural architecture for code completion with multi-task learning, in: Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 37–47.
[12]
Liu F., Li G., Wei B., Xia X., Fu Z., Jin Z., A unified multi-task learning model for AST-level and token-level code completion, Empir. Softw. Eng. 27 (4) (2022) 1–38.
[13]
Guo D., Lu S., Duan N., Wang Y., Zhou M., Yin J., Unixcoder: Unified cross-modal pre-training for code representation, 2022, arXiv preprint arXiv:2203.03850.
[14]
Raychev V., Bielik P., Vechev M., Probabilistic model for code with decision trees, ACM SIGPLAN Notices 51 (10) (2016) 731–747.
[15]
D. Hou, D.M. Pletcher, Towards a better code completion system by API grouping, filtering, and popularity-based ranking, in: Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering, 2010, pp. 26–30.
[16]
Robbes R., Lanza M., How program history can improve code completion, in: 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, IEEE, 2008, pp. 317–326.
[17]
M. Bruch, M. Monperrus, M. Mezini, Learning from examples to improve code completion systems, in: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2009, pp. 213–222.
[18]
Hindle A., Barr E.T., Su Z., Gabel M., Devanbu P., On the naturalness of software, in: Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, IEEE Press, 2012, pp. 837–847.
[19]
Hindle A., Barr E.T., Gabel M., Su Z., Devanbu P., On the naturalness of software, Commun. ACM 59 (5) (2016) 122–131.
[20]
Y. Wang, H. Li, Code completion by modeling flattened abstract syntax trees as graphs, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, (16) 2021, pp. 14015–14023.
[21]
Sennrich R., Haddow B., Birch A., Neural machine translation of rare words with subword units, 2015, arXiv preprint arXiv:1508.07909.
[22]
Fu M., Tantithamthavorn C., GPT2SP: A transformer-based agile story point estimation approach, IEEE Trans. Softw. Eng. (2022).
[23]
Karampatsis R.-M., Babii H., Robbes R., Sutton C., Janes A., Big code!=big vocabulary: Open-vocabulary models for source code, in: 2020 IEEE/ACM 42nd International Conference on Software Engineering, (ICSE), IEEE, 2020, pp. 1073–1085.
[24]
Thongtanunam P., Pornprasit C., Tantithamthavorn C., AutoTransform: Automated code transformation to support modern code review process, 2022.
[25]
Phang J., Févry T., Bowman S.R., Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks, 2018, arXiv preprint arXiv:1811.01088.
[26]
Ruder S., An overview of multi-task learning in deep neural networks, 2017, arXiv preprint arXiv:1706.05098.
[27]
Golub G.H., Van Loan C.F., Matrix Computations, JHU Press, 2013.
[28]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, in: Advances in Neural Information Processing Systems, Vol. 30, 2017.
[29]
Husain H., Wu H.-H., Gazit T., Allamanis M., Brockschmidt M., Codesearchnet challenge: Evaluating the state of semantic code search, 2019, arXiv preprint arXiv:1909.09436.
[30]
Holtzman A., Buys J., Du L., Forbes M., Choi Y., The curious case of neural text degeneration, 2019, arXiv preprint arXiv:1904.09751.
[31]
Li J., Monroe W., Ritter A., Jurafsky D., Galley M., Gao J., Deep reinforcement learning for dialogue generation, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 1192–1202,. URL https://aclanthology.org/D16-1127.
[32]
Wiseman S., Shieber S., Rush A., Challenges in data-to-document generation, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2253–2263,. URL https://aclanthology.org/D17-1239.
[33]
Ackley D.H., Hinton G.E., Sejnowski T.J., A learning algorithm for Boltzmann machines, Cogn. Sci. 9 (1) (1985) 147–169.
[34]
Vandenhende S., Georgoulis S., Van Gansbeke W., Proesmans M., Dai D., Van Gool L., Multi-task learning for dense prediction tasks: A survey, IEEE Trans. Pattern Anal. Mach. Intell. (2021).
[35]
Sener O., Koltun V., Multi-task learning as multi-objective optimization, in: Advances in Neural Information Processing Systems, Vol. 31, 2018.
[36]
Wang W., Shen S., Li G., Jin Z., Towards full-line code completion with neural language models, 2020, arXiv preprint arXiv:2009.08603.
[37]
Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al., Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems, Vol. 32, 2019.
[38]
Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., Moi A., Cistac P., Rault T., Louf R., Funtowicz M., et al., Huggingface’s transformers: State-of-the-art natural language processing, 2019, arXiv preprint arXiv:1910.03771.
[39]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[40]
Levenshtein V.I., Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Sov. Phys. Doklady 10 (1966) 707.
[41]
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
[42]
S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/Or Summarization, 2005, pp. 65–72.
[43]
Lin C.-Y., Rouge: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, 2004, pp. 74–81.
[44]
C.Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, (ACL-04), 2004, pp. 605–612.
[45]
Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., Liu P.J., et al., Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (140) (2020) 1–67.
[46]
F. Liu, G. Li, Y. Zhao, Z. Jin, Multi-task learning based pre-trained language model for code completion, in: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 473–485.

Cited By

View all
  • (2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024

Index Terms

  1. Syntax-aware on-the-fly code completion
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Information and Software Technology
    Information and Software Technology  Volume 165, Issue C
    Jan 2024
    193 pages

    Publisher

    Butterworth-Heinemann

    United States

    Publication History

    Published: 01 January 2024

    Author Tags

    1. Code completion
    2. Multi-task learning

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media