research-article

Learning to generate pseudo-code from source code using statistical machine translation

Authors:

Hiroyuki Fudaba,

Sakriani Sakti,

Satoshi NakamuraAuthors Info & Claims

ASE '15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering

Pages 574 - 584

https://doi.org/10.1109/ASE.2015.36

Published: 09 November 2015 Publication History

Abstract

Pseudo-code written in natural language can aid the comprehension of source code in unfamiliar programming languages. However, the great majority of source code has no corresponding pseudo-code, because pseudo-code is redundant and laborious to create. If pseudo-code could be generated automatically and instantly from given source code, we could allow for on-demand production of pseudo-code without human effort. In this paper, we propose a method to automatically generate pseudo-code from source code, specifically adopting the statistical machine translation (SMT) framework. SMT, which was originally designed to translate between two natural languages, allows us to automatically learn the relationship between source code/pseudo-code pairs, making it possible to create a pseudo-code generator with less human effort. In experiments, we generated English or Japanese pseudo-code from Python statements using SMT, and find that the generated pseudo-code is largely accurate, and aids code understanding.

References

[1]

R. DeLine, G. Venolia, and K. Rowan, "Software development with code maps," Commun. ACM, vol. 53, no. 8, pp. 48--54, 2010.

Digital Library

[2]

M. M. Rahman and C. K. Roy, "Surfclipse: Context-aware meta search in the ide," in Proc. ICSME, 2014, pp. 617--620.

Digital Library

[3]

M.-A. Storey, "Theories, tools and research methods in program comprehension: Past, present and future," Software Quality Journal, vol. 14, no. 3, pp. 187--208, 2006.

Digital Library

[4]

G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker, "Towards automatically generating summary comments for java methods," in Proc. ASE, 2010, pp. 43--52.

Digital Library

[5]

G. Sridhara, L. Pollock, and K. Vijay-Shanker, "Automatically detecting and describing high level actions within methods," in Proc. ICSE, 2011, pp. 101--110.

Digital Library

[6]

R. P. Buse and W. R. Weimer, "Automatic documentation inference for exceptions," in Proc. ISSTA, 2008, pp. 273--282.

Digital Library

[7]

L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-Shanker, "Automatic generation of natural language summaries for java classes," in Proc. ICPC, 2013, pp. 23--32.

[8]

E. Wong, J. Yang, and L. Tan, "Autocomment: Mining question and answer sites for automatic comment generation," in Proc. ASE, 2013, pp. 562--567.

Digital Library

[9]

S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, "On the use of automated text summarization techniques for summarizing source code," in Proc. WCRE, 2010, pp. 35--44.

Digital Library

[10]

B. P. Eddy, J. A. Robinson, N. A. Kraft, and J. C. Carver, "Evaluating source code summarization techniques: Replication and expansion," in Proc. ICPC, 2013, pp. 13--22.

[11]

P. Rodeghero, C. McMillan, P. W. McBurney, N. Bosch, and S. D'Mello, "Improving automated source code summarization via an eye-tracking study of programmers," in Proc. ICSE, 2014, pp. 390--401.

Digital Library

[12]

P. Koehn, Statistical Machine Translation. Cambridge University Press, 2010.

Digital Library

[13]

A. Lopez, "Statistical machine translation," ACM Computing Surveys, vol. 40, no. 3, pp. 8:1--8:49, 2008.

Digital Library

[14]

P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer, "The mathematics of statistical machine translation: Parameter estimation," Computational Linguistics, vol. 19, no. 2, pp. 263--311, 1993.

Digital Library

[15]

P. Koehn, F. J. Och, and D. Marcu, "Statistical phrase-based translation," in Proc. NAACL-HLT, 2003, pp. 48--54.

Digital Library

[16]

S. Karaivanov, V. Raychev, and M. Vechev, "Phrase-based statistical translation of programming languages," in Proc. Onward!, 2014, pp. 173--184.

Digital Library

[17]

F. J. Och and H. Ney, "The alignment template approach to statistical machine translation," Computational Linguistics, vol. 30, no. 4, pp. 417--449, 2004.

Digital Library

[18]

L. Huang, K. Knight, and A. Joshi, "Statistical syntax-directed translation with extended domain of locality," in Proc. AMTA, vol. 2006, 2006, pp. 223--226.

[19]

D. Klein and C. D. Manning, "Accurate unlexicalized parsing," in Proc. ACL, 2003, pp. 423--430.

Digital Library

[20]

S. Petrov, L. Barrett, R. Thibaux, and D. Klein, "Learning accurate, compact, and interpretable tree annotation," in Proceedings of COLING-ACL, 2006, pp. 433--440.

Digital Library

[21]

P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer, "The mathematics of statistical machine translation: Parameter estimation," Computational Linguistics, vol. 19, no. 2, pp. 263--311, Jun. 1993.

Digital Library

[22]

G. Neubig, T. Watanabe, E. Sumita, S. Mori, and T. Kawahara, "An unsupervised model for joint phrase alignment and extraction," in Proc. ACL-HLT, Portland, Oregon, USA, 6 2011, pp. 632--641.

Digital Library

[23]

M. Galley, M. Hopkins, K. Knight, and D. Marcu, "What's in a translation rule?" in Proc. NAACL-HLT, 2004, pp. 273--280.

[24]

R. Kneser and H. Ney, "Improved backing-off for m-gram language modeling," in Proc. ICASSP, 1995, pp. 181--184.

[25]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proc. ICSE, 2012, pp. 837--847.

Digital Library

[26]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "A statistical semantic language model for source code," in Proc. FSE, 2013, pp. 532--542.

Digital Library

[27]

Z. Tu, Z. Su, and P. Devanbu, "On the localness of software," in Proc. FSE, 2014, pp. 269--280.

Digital Library

[28]

A. T. Nguyen and T. N. Nguyen, "Graph-based statistical language model for code," in Proc. ICSE, 2015.

Digital Library

[29]

T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying conditional random fields to Japanese morphological analysis." in Proc. EMNLP, vol. 4, 2004, pp. 230--237.

[30]

K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-Ney language model estimation," in Proc. ACL, Sofia, Bulgaria, August 2013, pp. 690--696.

[31]

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, "Moses: Open source toolkit for statistical machine translation," in Proc. ACL, 2007, pp. 177--180.

Digital Library

[32]

G. Neubig, "Travatar: A forest-to-string machine translation engine based on tree transducers," in Proc. ACL, Sofia, Bulgaria, August 2013, pp. 91--96.

[33]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: A method for automatic evaluation of machine translation," in Proc. ACL, 2002, pp. 311--318.

Digital Library

[34]

I. Goto, K. P. Chow, B. Lu, E. Sumita, and B. K. Tsou, "Overview of the patent machine translation task at the ntcir-10 workshop," in NTCIR-10, 2013.

[35]

O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna, "Findings of the 2014 workshop on statistical machine translation," in Proc. WMT, 2014, pp. 12--58.

[36]

P. Koehn, "Statistical significance tests for machine translation evaluation," in Proc. EMNLP, 2004, pp. 388--395.

Cited By

Phan HJannesari A(2024)Leveraging Statistical Machine Translation for Code SearchProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661233(191-200)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661233
Yang ZLiu FYu ZKeung JLi JLiu SHong YMa XJin ZLi G(2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660778
Guan XTreude CBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Enhancing Source Code Representations for Deep Learning with Static AnalysisProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644396(64-68)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644396
Show More Cited By

Learning to generate pseudo-code from source code using statistical machine translation
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

SOURCE CODE TO FLOWCHART CONVERTER: A generic framework for source code to flowchart conversion
Using dynamic programming to generate optimized code in a Graham-Glanville style code generator
Proceedings of the SIGPLAN '84 symposium on compiler construction

We have performed an investigation of using a dynamic programming to generate optimized code in a Graham-Glanville style code generator We use Earley's algorithm rather than an IR algorithm for parsing in the code generator Not only does the use of ...
Word Sense Based Hindi-Tamil Statistical Machine Translation

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering

November 2015

935 pages

ISBN:9781509000241

General Chair:
Myra Cohen
University of Nebraska-Lincoln
,
Program Chairs:
Lars Grunske
University of Stuttgart, Germany
,
Michael Whalen
University of Minnesota

Sponsors

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2015

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASE '15

Sponsor:

ASE '15: ACM/IEEE International Conference on Automated Software Engineering

November 9 - 15, 2015

Nebraska, Lincoln

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Phan HJannesari A(2024)Leveraging Statistical Machine Translation for Code SearchProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661233(191-200)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661233
Yang ZLiu FYu ZKeung JLi JLiu SHong YMa XJin ZLi G(2024)Exploring and Unleashing the Power of Large Language Models in Automated Code TranslationProceedings of the ACM on Software Engineering10.1145/36607781:FSE(1585-1608)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660778
Guan XTreude CBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)Enhancing Source Code Representations for Deep Learning with Static AnalysisProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644396(64-68)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644396
Yang GZhou YYang WYue TChen XChen T(2024)How Important Are Good Method Names in Neural Code Generation? A Model Robustness PerspectiveACM Transactions on Software Engineering and Methodology10.1145/363001033:3(1-35)Online publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1145/3630010
Arefin MShetiya SWang ZCsallner CRoychoudhury APaiva AAbreu RStorey M(2024)Fast Deterministic Black-box Context-free Grammar InferenceProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639214(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639214
Zhu QLiang QSun ZXiong YZhang LCheng SRoychoudhury APaiva AAbreu RStorey M(2024)GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639125(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639125
Chen JHu XLi ZGao CXia XLo DRoychoudhury APaiva AAbreu RStorey M(2024)Code Search is All You Need? Improving Code Suggestions with Code SearchProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639085(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639085
Sharma TKechagia MGeorgiou STiwari RVats IMoazen HSarro F(2024)A survey on machine learning techniques applied to source codeJournal of Systems and Software10.1016/j.jss.2023.111934209:COnline publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111934
Shin JWei MWang JShi LWang S(2023)The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning TasksACM Transactions on Software Engineering and Methodology10.1145/363000933:2(1-24)Online publication date: 22-Dec-2023
https://dl.acm.org/doi/10.1145/3630009
He YWang LWang KZhang YZhang HLi ZJust RFraser G(2023)COME: Commit Message Generation with Modification EmbeddingProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598096(792-803)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598096
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents