Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3416506.3423576acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Open access

Statistical machine translation outperforms neural machine translation in software engineering: why and how

Published: 08 November 2020 Publication History

Abstract

Neural Machine Translation (NMT) is the current trend approach in Natural Language Processing (NLP) to solve the problem of auto- matically inferring the content of target language given the source language. The ability of NMT is to learn deep knowledge inside lan- guages by deep learning approaches. However, prior works show that NMT has its own drawbacks in NLP and in some research problems of Software Engineering (SE). In this work, we provide a hypothesis that SE corpus has inherent characteristics that NMT will confront challenges compared to the state-of-the-art translation engine based on Statistical Machine Translation. We introduce a problem which is significant in SE and has characteristics that challenges the abil- ity of NMT to learn correct sequences, called Prefix Mapping. We implement and optimize the original SMT and NMT to mitigate those challenges. By the evaluation, we show that SMT outperforms NMT for this research problem, which provides potential directions to optimize the current NMT engines for specific classes of parallel corpus. By achieving the accuracy from 65% to 90% for code tokens generation of 1000 Github code corpus, we show the potential of using MT for code completion at token level.

References

[1]
Hamed Habibi Aghdam et al. 2017. Guide to Convolutional Neural Networks: A Practical Application to Trafic-Sign Detection and Classification (1st ed.). Springer Publishing Company, Incorporated.
[2]
A. Alhefdhi et al. 2018. Generating Pseudo-Code from Source Code Using Deep Learning. In 2018 25th Australasian Software Engineering Conference (ASWEC). 21-25. https://doi.org/10.1109/ASWEC. 2018.00011
[3]
M. Allamanis et al. 2013. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR). 207-216. https://doi.org/10.1109/MSR. 2013.6624029
[4]
Miltiadis Allamanis et al. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37 ( Lille, France) (ICML'15). JMLR.org, 2123-2132.
[5]
Miltiadis Allamanis et al. 2015. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015 ). Association for Computing Machinery, New York, NY, USA, 38-49. https://doi.org/10.1145/2786805.2786849
[6]
Miltiadis Allamanis et al. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4, Article 81 ( July 2018 ), 37 pages. https://doi.org/10.1145/3212695
[7]
Project Android. 2019. Example of method fireQueueStateChanged() in the Android project. https://tinyurl.com/ukgnrph.
[8]
Stanley F. Chen et al. 1996. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (Santa Cruz, California) (ACL '96). Association for Computational Linguistics, USA, 310-318. https://doi.org/10.3115/981863.981904
[9]
Marta R. Costa-Jussà et al. 2014. Statistical Machine Translation Enhancements through Linguistic Levels: A Survey. ACM Comput. Surv. 46, 3, Article 42 ( Jan. 2014 ), 28 pages. https://doi.org/10.1145/2518130
[10]
Hoa Khanh Dam et al. 2016. A deep language model for software code. CoRR abs/1608.02715 ( 2016 ). arXiv: 1608.02715 http://arxiv.org/abs/1608.02715
[11]
Mehdi Drissi et al. 2018. Program Language Translation Using a GrammarDriven Tree-to-Tree Model. CoRR abs/ 1807.01784 ( 2018 ). arXiv: 1807.01784 http://arxiv.org/abs/ 1807.01784
[12]
Fabio Ferreira et al. 2019. Software Engineering Meets Deep Learning: A Literature Review. arXiv: 1909. 11436 [cs.SE]
[13]
Torumoy Ghoshal et al. 2019. Improving Performance of Convolutional Neural Networks via Feature Embedding. In Proceedings of the 2019 ACM Southeast Conference (Kennesaw, GA, USA) (ACM SE ' 19 ). Association for Computing Machinery, New York, NY, USA, 31-38. https://doi.org/10.1145/3299815.3314429
[14]
Spence Green et al. 2014. Phrasal: A Toolkit for New Directions in Statistical Machine Translation. In In Proceddings of the Ninth Workshop on Statistical Machine Translation.
[15]
Xiaodong Gu et al. 2016. Deep API Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Seattle, WA, USA) ( FSE 2016 ). Association for Computing Machinery, New York, NY, USA, 631-642. https://doi.org/10.1145/2950290.2950334
[16]
Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation. Edinburgh, Scotland, United Kingdom, 187-197. https://kheafield.com/papers/ avenue/kenlm.pdf
[17]
Kenneth Heafield et al. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2 : Short Papers). Association for Computational Linguistics, Sofia, Bulgaria, 690-696. https://www.aclweb.org/anthology/P13-2121
[18]
Vincent J. Hellendoorn et al. 2018. Deep Learning Type Inference. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018 ). Association for Computing Machinery, New York, NY, USA, 152-162. https://doi.org/10.1145/3236024.3236051
[19]
Abram Hindle et al. 2012. On the Naturalness of Software. In Proceedings of the 34th ICSE 2012 ( Zurich, Switzerland) (ICSE '12). IEEE Press, 837-847.
[20]
Eclipse JDT. 2020. Eclipse JDT. https://tinyurl.com/yx5hto3f.
[21]
L. Jiang et al. 2019. Machine Learning Based Recommendation of Method Names: How Far are We. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 602-614.
[22]
Kun Jing et al. 2019. A Survey on Neural Network Language Models. CoRR abs/ 1906.03591 ( 2019 ). arXiv: 1906.03591 http://arxiv.org/abs/ 1906.03591
[23]
Aggarwal K et al. 2015. Using machine translation for converting Python 2 to Python 3 code. arXiv: 3 :e1459v1 https://peerj.com/preprints/1459v1/
[24]
Thomas N. Kipf et al. 2016. Semi-Supervised Classification with Graph Convolutional Networks. CoRR abs/1609.02907 ( 2016 ). arXiv: 1609.02907 http://arxiv.org/abs/1609.02907
[25]
Guillaume Klein et al. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proc. ACL. https://doi.org/10.18653/v1/ P17-4012
[26]
Philipp Koehn. 2010. Statistical Machine Translation (1st ed.). Cambridge University Press, USA.
[27]
Xiaochen Li et al. 2018. Deep Learning in Software Engineering. arXiv: 1805. 04825 [cs.SE]
[28]
Xi Victoria Lin et al. 2018. NL2Bash: A corpus and semantic parser for natural language interface to the Linux operating system. In LREC: Language Resources and Evaluation Conference. Miyazaki, Japan.
[29]
Zhongxin Liu et al. 2018. Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018 ). Association for Computing Machinery, New York, NY, USA, 373-384. https: //doi.org/10.1145/3238147.3238190
[30]
Minh-Thang Luong et al. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt ( 2017 ).
[31]
Thang Luong et al. 2014. Addressing the Rare Word Problem in Neural Machine Translation. CoRR abs/1410.8206 ( 2014 ). arXiv: 1410.8206 http://arxiv.org/abs/ 1410.8206
[32]
Anh Tuan Nguyen and others. 2014. Migrating Code with Statistical Machine Translation. In Companion Proceedings of the 36th ICSE (Hyderabad, India) (ICSE Companion 2014 ). Association for Computing Machinery, New York, NY, USA, 544-547. https://doi.org/10.1145/2591062.2591072
[33]
Y. Oda et al. 2015. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T). In Automated Software Engineering (ASE ) 2015. 574-584. https://doi.org/10.1109/ASE. 2015.36
[34]
Kishore Papineni et al. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) ( ACL '02). Association for Computational Linguistics, USA, 311-318. https://doi.org/10.3115/1073083. 1073135
[35]
Thai-Hoang Pham et al. 2017. On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration. CoRR abs/1709.07104 ( 2017 ). arXiv: 1709.07104 http://arxiv.org/abs/1709.07104
[36]
H. Phan et al. 2017. Statistical Learning for Inference between Implementations and Documentation. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 27-30. https: //doi.org/10.1109/ICSE-NIER. 2017.9
[37]
Hung Phan et al. 2018. Statistical Learning of API Fully Qualified Names in Code Snippets of Online Forums. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 632-642. https://doi.org/10.1145/3180155. 3180230
[38]
PrefixMap. 2020. PrefixMap data. https://pdhung3012.github.io/prefixmap.html.
[39]
RegexTester. 2019. Regular Expression collection site 2. https://www.regextester. com/97778.
[40]
ResearchGate. 2019. Question of Hidden Unit in NMT. https://tinyurl.com/ j5b2o3r.
[41]
C. M. K. Saifullah et al. 2019. Learning from Examples to Find Fully Qualified Names of API Elements in Code Snippets. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 243-254. https://doi.org/ 10.1109/ASE. 2019.00032
[42]
Alex Sherstinsky. 2018. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. CoRR abs/ 1808.03314 ( 2018 ). arXiv: 1808.03314 http://arxiv.org/abs/ 1808.03314
[43]
StackOverflow. 2019. Regular Expression collection site 1. https://stackoverflow. com/.
[44]
James V. Stone. 2013. Bayes' Rule: A Tutorial Introduction to Bayesian Analysis. Sebtel Press.
[45]
Thanh Van Nguyen et al. 2016. Characterizing API Elements in Software Documentation with Vector Representation. In Proceedings of the 38th International Conference on Software Engineering Companion (Austin, Texas) ( ICSE '16). Association for Computing Machinery, New York, NY, USA, 749-751. https://doi.org/10.1145/2889160.2892658
[46]
Xing Wang et al. 2016. Neural Machine Translation Advised by Statistical Machine Translation. CoRR abs/1610.05150 ( 2016 ). arXiv: 1610.05150 http://arxiv.org/abs/ 1610.05150
[47]
Xinyi Wang et al. 2018. A Tree-based Decoder for Neural Machine Translation. CoRR abs/ 1808.09374 ( 2018 ). arXiv: 1808.09374 http://arxiv.org/abs/ 1808.09374
[48]
Yonghui Wu et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 ( 2016 ). http://arxiv.org/abs/1609.08144
[49]
Sihan Xu et al. 2019. Method Name Suggestion with Hierarchical Attention Networks. In Proceedings of the 2019 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation (Cascais, Portugal) (PEPM 2019 ). Association for Computing Machinery, New York, NY, USA, 10-21. https://doi.org/10.1145/ 3294032.3294079
[50]
Pengcheng Yin et al. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. In International Conference on Mining Software Repositories (MSR). ACM, 476-486. https://doi.org/10.1145/3196398.3196408

Cited By

View all
  • (2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
  • (2022)Heterogeneous Graph Neural Networks for Software Effort EstimationProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546248(103-113)Online publication date: 19-Sep-2022
  • (2022)Story point level classification by text level graph neural networkProceedings of the 1st International Workshop on Natural Language-based Software Engineering10.1145/3528588.3528654(75-78)Online publication date: 21-May-2022
  • Show More Cited By

Index Terms

  1. Statistical machine translation outperforms neural machine translation in software engineering: why and how

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    RL+SE&PL 2020: Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages
    November 2020
    40 pages
    ISBN:9781450381253
    DOI:10.1145/3416506
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 November 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Neural Machine Translation
    2. Statistical Machine Translation

    Qualifiers

    • Research-article

    Conference

    ESEC/FSE '20
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)169
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 28 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
    • (2022)Heterogeneous Graph Neural Networks for Software Effort EstimationProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546248(103-113)Online publication date: 19-Sep-2022
    • (2022)Story point level classification by text level graph neural networkProceedings of the 1st International Workshop on Natural Language-based Software Engineering10.1145/3528588.3528654(75-78)Online publication date: 21-May-2022
    • (2022)Phrase2Set: Phrase-to-Set Machine Translation and Its Software Engineering Applications2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER53432.2022.00068(502-513)Online publication date: Mar-2022
    • (2022)Can we generate shellcodes via natural language? An empirical studyAutomated Software Engineering10.1007/s10515-022-00331-329:1Online publication date: 5-Mar-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media