Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2820518.2820559acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Toward deep learning software repositories

Published: 16 May 2015 Publication History

Abstract

Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these models to generalize well has ushered in tremendous advances in many fields such as natural language processing (NLP). Recent research in the software engineering (SE) community has demonstrated the usefulness of applying NLP techniques to software corpora. Hence, we motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models. Our deep learning models are applicable to source code files (since they only require lexically analyzed source code written in any programming language) and other types of artifacts. We show how a particular deep learning model can remember its state to effectively model sequential data, e.g., streaming software tokens, and the state is shown to be much more expressive than discrete tokens in a prefix. Then we instantiate deep learning models and show that deep learning induces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment with two of the models' hyperparameters, which govern their capacity and the amount of context they use to inform predictions, before building several committees of software language models to aid generalization. Then we apply the deep learning models to code suggestion and demonstrate their effectiveness at a real SE task compared to state-of-the-practice models. Finally, we propose avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts. Thus, our work serves as the first step toward deep learning software repositories.

References

[1]
F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA, USA: MIT Press, 1997.
[2]
P. Koehn, Statistical Machine Translation, 1st ed. New York, NY, USA: Cambridge University Press, 2010.
[3]
J. Goodman, "Classes for fast maximum entropy training," CoRR, vol. cs.CL/0108006, 2001.
[4]
D. Jurafsky and J. H. Martin, Speech and Language Processing, 2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009.
[5]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software," in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 837--847.
[6]
T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, "a statistical semantic language model for source code," in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE '13. New York, NY, USA: ACM, 2013, pp. 532--542.
[7]
S. Afshan, P. McMinn, and M. Stevenson, "Evolving readable string test inputs using a natural language model to reduce human oracle cost," in Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation, ser. ICST '13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 352--361.
[8]
D. Movshovitz-Attias and W. W. Cohen, "Natural language models for predicting programming comments," in ACL. Sofia, Bulgaria: Association for Computational Linguistics, August 2013.
[9]
M. Allamanis and C. A. Sutton, "Mining source code repositories at massive scale using language modeling," in MSR, 2013, pp. 207--216.
[10]
J. C. Campbell, A. Hindle, and J. N. Amaral, "Syntax errors just aren't natural: Improving error reporting with language models," in Proceedings of the 11th Working Conference on Mining Software Repositories, ser. MSR '14. New York, NY, USA: ACM, 2014, pp. 252--261.
[11]
P. Tonella, R. Tiella, and D. C. Nguyen, "Interpolated n-grams for model based testing," in ICSE, 2014, pp. 562--572.
[12]
Z. Tu, Z. Su, and P. Devanbu, "On the localness of software," in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE '14. New York, NY, USA: ACM, 2014, pp. 269--280.
[13]
M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, "Learning natural coding conventions," in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE '14. New York, NY, USA: ACM, 2014, pp. 281--293.
[14]
R. Rosenfeld, "Two decades of statistical language modeling: Where do we go from here?" in Proceedings of the IEEE, vol. 88, 2000, pp. 1270--1278.
[15]
A. Mnih and Y. W. Teh, "A fast and simple algorithm for training neural probabilistic language models," in Proceedings of the 29th International Conference on Machine Learning, 2012, pp. 1751--1758.
[16]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, "A neural probabilistic language model," J. Mach. Learn. Res., vol. 3, pp. 1137--1155, Mar. 2003.
[17]
F. Morin and Y. Bengio, "Hierarchical probabilistic neural network language model," in Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2005, pp. 246--252.
[18]
H. Schwenk and J.-L. Gauvain, "Training neural network language models on very large corpora," in Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, ser. HLT '05. Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp. 201--208.
[19]
A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen, "Lexical statistical machine translation for language migration," in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE '13. New York, NY, USA: ACM, 2013, pp. 651--654.
[20]
A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen, "Migrating code with statistical machine translation," in Companion Proceedings of the 36th International Conference on Software Engineering, ser. ICSE Companion '14. New York, NY, USA: ACM, 2014, pp. 544--547.
[21]
A. T. Nguyen, H. A. Nguyen, T. T. Nguyen, and T. N. Nguyen, "Statistical learning approach for mining API usage mappings for code migration," in Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ser. ASE '14. New York, NY, USA: ACM, 2014, pp. 457--468.
[22]
S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling," in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ser. ACL '96. Stroudsburg, PA, USA: Association for Computational Linguistics, 1996, pp. 310--318.
[23]
I. Sutskever, J. Martens, and G. Hinton, "Generating text with recurrent neural networks," in Proceedings of the 28th International Conference on Machine Learning (ICML-11), ser. ICML '11. New York, NY, USA: ACM, June 2011, pp. 1017--1024.
[24]
E. Arisoy, T. N. Sainath, B. Kingsbury, and B. Ramabhadran, "Deep neural network language models," in Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. Montréal, Canada: Association for Computational Linguistics, June 2012, pp. 20--28.
[25]
T. Mikolov, "Statistical language models based on neural networks," Ph.D. dissertation, 2012.
[26]
Y. Bengio, "Learning deep architectures for AI," Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1--127, Jan. 2009.
[27]
T. Mikolov, W. tau Yih, and G. Zweig, "Linguistic regularities in continuous space word representations." in HLT-NAACL. The Association for Computational Linguistics, 2013, pp. 746--751.
[28]
R. Rosenfeld, "A maximum entropy approach to adaptive statistical language modeling," Computer, Speech and Language, vol. 10, pp. 187--228, 1996.
[29]
Y. Bengio, "Deep learning of representations: Looking forward," in Proceedings of the First International Conference on Statistical Language and Speech Processing, ser. SLSP '13. Berlin, Heidelberg: Springer-Verlag, 2013, pp. 1--37.
[30]
B.-J. P. Hsu, "Language modeling for limited-data domains," Ph.D. dissertation, Cambridge, MA, USA, 2009.
[31]
R. Kuhn and R. De Mori, "A cache-based natural language model for speech recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 6, pp. 570--583, Jun. 1990.
[32]
P. Clarkson and A. Robinson, "Language model adaptation using mixtures and an exponentially decaying cache," in In Proceedings of ICASSP-97, 1997, pp. 799--802.
[33]
G. E. Hinton, "Connectionist learning procedures," Artif. Intell., vol. 40, no. 1-3, pp. 185--234, 1989.
[34]
C. M. Bishop and J. Lasserre, "Generative or discrimative? Getting the best of both worlds," in Bayesian Statistics 8, International Society for Bayesian Analysis. Oxford University Pres, 2007, pp. 3--24.
[35]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Neurocomputing: Foundations of research." Cambridge, MA, USA: MIT Press, 1988, ch. Learning Representations by Back-propagating Errors, pp. 696--699.
[36]
G. E. Hinton, "Learning distributed representations of concepts," in Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum, 1986, pp. 1--12.
[37]
G. E. Hinton, J. L. McClelland, and D. E. Rumelhart, "Distributed representations," in Parallel Distributed Processing. Volume 1: Foundations. Cambridge, MA: MIT Press, 1986, ch. 3, pp. 77--109.
[38]
N. K. Sinha and M. M. Gupta, Soft Computing and Intelligent Systems: Theory and Applications, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999.
[39]
I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, "On the importance of initialization and momentum in deep learning," in Proceedings of the 30th International Conference on Machine Learning (ICML-13), vol. 28, no. 3. JMLR Workshop and Conference Proceedings, May 2013, pp. 1139--1147.
[40]
M. Hermans and B. Schrauwen, "Training and analysing deep recurrent neural networks," in Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 190--198.
[41]
R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio, "How to construct deep recurrent neural networks," CoRR, vol. abs/1312.6026, 2013.
[42]
O. İrsoy and C. Cardie, "Opinion mining with deep recurrent neural networks," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014, pp. 720--728.
[43]
J. L. Elman, "Finding structure in time," COGNITIVE SCIENCE, vol. 14, no. 2, pp. 179--211, 1990.
[44]
R. Miikkulainen and M. G. Dyer, "Natural language processing with modular neural networks and distributed lexicon," Cognitive Science, vol. 15, pp. 343--399, 1991.
[45]
M. I. Jordan, "Serial order: A parallel distributed processing approach," Institute for Cognitive Science, University of California, San Diego, Tech. Rep. ICS Report 8604, 1986.
[46]
T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cernocký, "RNNLM - Recurrent neural network language modeling toolkit," in Proceedings of ASRU 2011. IEEE Signal Processing Society, 2011, pp. 1--4.
[47]
T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocký, "Strategies for training large scale neural network language models," in ASRU, 2011, pp. 196--201.
[48]
T. Mikolov, S. Kombrink, L. Burget, J. Cernocký, and S. Khudanpur, "Extensions of recurrent neural network language model," in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011. IEEE Signal Processing Society, 2011, pp. 5528--5531.
[49]
Y. Shi, W. Zhang, J. Liu, and M. T. Johnson, "RNN language model with word clustering and class-based output layer," EURASIP J. Audio, Speech and Music Processing, vol. 2013, p. 22, 2013.
[50]
Y. Bengio, "Practical recommendations for gradient-based training of deep architectures," CoRR, vol. abs/1206.5533, 2012.
[51]
Y. Bengio, A. Courville, and P. Vincent, "Representation learning: A review and new perspectives," 2012.
[52]
X. Glorot, A. Bordes, and Y. Bengio, "Deep sparse rectifier neural networks," in JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), Apr. 2011.
[53]
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "Backpropagation applied to handwritten zip code recognition," Neural Comput., vol. 1, no. 4, pp. 541--551, Dec. 1989.
[54]
Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," Trans. Neur. Netw., vol. 5, no. 2, pp. 157--166, Mar. 1994.
[55]
P. Werbos, "Backpropagation through time: what does it do and how to do it," in Proceedings of IEEE, vol. 78, no. 10, 1990, pp. 1550--1560.
[56]
J. Schmidhuber, "Learning complex, extended sequences using the principle of history compression," Neural Comput., vol. 4, no. 2, pp. 234--242, Mar. 1992.
[57]
R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 1310--1318.
[58]
A. Mnih and G. Hinton, "Three new graphical models for statistical language modelling," in Proceedings of the 24th International Conference on Machine Learning, ser. ICML '07. New York, NY, USA: ACM, 2007, pp. 641--648.
[59]
(2014) JFlex. {Online}. Available: http://jflex.de/
[60]
S. M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer," in IEEE Transactions on Acoustics, Speech and Signal Processing, 1987, pp. 400--401.
[61]
F. Jelinek and R. L. Mercer, "Interpolated estimation of markov source parameters from sparse data," in In Proceedings of the Workshop on Pattern Recognition in Practice, May 1980, pp. 381--397.
[62]
A. Stolcke, "SRILM - an extensible language modeling toolkit." in INTERSPEECH. ISCA, 2002.
[63]
C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[64]
D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures., 2nd ed. Chapman & Hall/CRC, 2000.
[65]
M. Nagappan, T. Zimmermann, and C. Bird, "Diversity in software engineering research," ser. ESEC/FSE '13, 2013, pp. 466--476.
[66]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen, "Boa: A language and infrastructure for analyzing ultra-large-scale software repositories," ser. ICSE '13, 2013, pp. 422--431.
[67]
R. Socher, C. C.-Y. Lin, A. Y. Ng, and C. D. Manning, "Parsing natural scenes and natural language with recursive neural networks." in ICML. Omnipress, 2011, pp. 129--136.
[68]
A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, "How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms," in Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE '13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 522--531.
[69]
J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization," J. Mach. Learn. Res., vol. 13, no. 1, pp. 281--305, Feb. 2012.
[70]
J. Snoek, H. Larochelle, and R. P. Adams, "Practical bayesian optimization of machine learning algorithms," in Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 2951--2959.
[71]
J. Yang and L. Tan, "Inferring semantically related words from software context," in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, ser. MSR '12. Piscataway, NJ, USA: IEEE Press, 2012, pp. 161--170.
[72]
M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker, "Automatically mining software-based, semantically-similar words from comment-code mappings," in Proceedings of the 10th Working Conference on Mining Software Repositories, ser. MSR '13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 377--386.
[73]
Y. Tian, D. Lo, and J. L. Lawall, "Automated construction of a software-specific word similarity database," in 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014, 2014, pp. 44--53.
[74]
Y. Tian, D. Lo, and J. Lawall, "Sewordsim: Software-specific word similarity database," in Companion Proceedings of the 36th International Conference on Software Engineering, ser. ICSE Companion '14. New York, NY, USA: ACM, 2014, pp. 568--571.
[75]
J. Yang and L. Tan, "Swordnet: Inferring semantically related words from software context," Empirical Softw. Engg., vol. 19, no. 6, pp. 1856--1886, Dec. 2014.
[76]
L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM, 1997.

Cited By

View all
  • (2023)Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3594236(398-405)Online publication date: 14-Jun-2023
  • (2023)Automated WebAssembly Function Purpose Identification With Semantics-Aware AnalysisProceedings of the ACM Web Conference 202310.1145/3543507.3583235(2885-2894)Online publication date: 30-Apr-2023
  • (2023)On the Robustness of Code Generation Techniques: An Empirical Study on GitHub CopilotProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00181(2149-2160)Online publication date: 14-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '15: Proceedings of the 12th Working Conference on Mining Software Repositories
May 2015
542 pages
ISBN:9780769555942

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 May 2015

Check for updates

Author Tags

  1. deep learning
  2. machine learning
  3. n-grams
  4. neural networks
  5. software language models
  6. software repositories

Qualifiers

  • Research-article

Conference

ICSE '15
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical StudyProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering10.1145/3593434.3594236(398-405)Online publication date: 14-Jun-2023
  • (2023)Automated WebAssembly Function Purpose Identification With Semantics-Aware AnalysisProceedings of the ACM Web Conference 202310.1145/3543507.3583235(2885-2894)Online publication date: 30-Apr-2023
  • (2023)On the Robustness of Code Generation Techniques: An Empirical Study on GitHub CopilotProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00181(2149-2160)Online publication date: 14-May-2023
  • (2022)Path-sensitive code embedding via contrastive learning for software vulnerability detectionProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3533767.3534371(519-531)Online publication date: 18-Jul-2022
  • (2022)To what extent do deep learning-based code recommenders generate predictions by cloning code from the training set?Proceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528440(167-178)Online publication date: 23-May-2022
  • (2022)A Systematic Literature Review on the Use of Deep Learning in Software Engineering ResearchACM Transactions on Software Engineering and Methodology10.1145/348527531:2(1-58)Online publication date: 4-Mar-2022
  • (2021)Understanding neural code intelligence through program simplificationProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468539(441-452)Online publication date: 20-Aug-2021
  • (2021)DeepWukongACM Transactions on Software Engineering and Methodology10.1145/343687730:3(1-33)Online publication date: 23-Apr-2021
  • (2021)Software engineering meets deep learningProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442029(1542-1549)Online publication date: 22-Mar-2021
  • (2021)CUREProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00107(1161-1173)Online publication date: 22-May-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media