Abstract
A software process is a set of related activities that culminates in the production of a software package: specification, design, implementation, testing, evolution into new versions, and maintenance. There are also other supporting activities such as configuration and change management, quality assurance, project management, evaluation of user experience, etc. Software repositories are infrastructures to support all these activities. They can be composed with several systems that include code change management, bug tracking, code review, build system, release binaries, wikis, forums, etc. This position paper on mining software repositories presents a review and a discussion of research in this field over the past decade. We also identify applied machine learning strategies, current working topics, and future challenges for the improvement of company decision-making systems. Machine learning is defined as the process of discovering patterns in data. It can be applied to software repositories, since every change is recorded as data. Companies can then use these patterns as the basis for their decision-making systems and for knowledge discovery.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ali, N., Guhneuc, Y.G., Antoniol, G.: Trustrace: mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans. Softw. Eng. 39(5), 725–741 (2013). https://doi.org/10.1109/TSE.2012.71
Arnaoudova, V., Eshkevari, L., Penta, M., Oliveto, R., Antoniol, G., Guhneuc, Y.G.: Repent: analyzing the nature of identifier renamings. IEEE Trans. Softw. Eng. 40(5), 502–532 (2014). https://doi.org/10.1109/TSE.2014.2312942
Bavota, G., Linares-Vsquez, M., Bernal-Crdenas, C., Di Penta, M., Oliveto, R., Poshyvanyk, D.: The impact of api change- and fault-proneness on the user ratings of android apps. IEEE Trans. Softw. Eng. 41(4), 384–407 (2015). https://doi.org/10.1109/TSE.2014.2367027
Brown, W.H., Malveau, R.C., McCormick, H.W.S., Mowbray, T.J.: AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis, 1st edn. Wiley, New York, NY (1998)
Canfora, G., Cerulo, L., Cimitile, M., Di Penta, M.: How changes affect software entropy: an empirical study. Empir. Softw. Eng. 19(1), 1–38 (2014). https://doi.org/10.1007/s10664-012-9214-z
Chen, T.H., Thomas, S., Hassan, A.: A survey on the use of topic models when mining software repositories. Empir. Softw. Eng. 21(5), 1843–1919 (2016). https://doi.org/10.1007/s10664-015-9402-8
Chowdhury, S.A., Hindle, A.: Mining stackoverflow to filter out off-topic irc discussion. In: Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, pp. 422–425. IEEE Press, Piscataway, NJ, USA (2015). http://dl.acm.org/citation.cfm?id=2820518.2820577
Dagenais, B., Robillard, M.: Recommending adaptive changes for framework evolution. ACM Trans. Softw. Eng. Methodol. 20(4), 9 (2011). https://doi.org/10.1145/2000799.2000805
Destefanis, G., Ortu, M., Counsell, S., Swift, S., Marchesi, M., Tonelli, R.: Software development: do good manners matter? PeerJ Comput. Sci. (2016). https://doi.org/10.7717/peerj-cs.73
Dyer, R., Nguyen, H., Rajan, H., Nguyen, T.: Boa: Ultra-large-scale software repository and source-code mining. ACM Trans. Softw. Eng. Methodol. 25(1), 7 (2015). https://doi.org/10.1145/2803171
German, D.M.: A study of the contributors of postgresql. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, pp. 163–164. ACM, New York, NY, USA (2006). https://doi.org/10.1145/1137983.1138022
Gonzalez-Barahona, J., Robles, G., Herraiz, I., Ortega, F.: Studying the laws of software evolution in a long-lived floss project. J. Softw. Evolut. Process 26(7), 589–612 (2014). https://doi.org/10.1002/smr.1615
Gonzlez-Barahona, J., Robles, G.: On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir. Softw. Eng. 17(1–2), 75–89 (2012). https://doi.org/10.1007/s10664-011-9181-9
Goyal, A., Sardana, N.: Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. E-Inf. Softw. Eng. J. 11(1), 103–116 (2017). https://doi.org/10.5277/e-Inf170105
Grant, S., Betts, B.: Encouraging user behaviour with achievements: an empirical study. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pp. 65–68. IEEE Press, Piscataway, NJ, USA (2013). http://dl.acm.org/citation.cfm?id=2487085.2487101
Guana, V., Rocha, F., Hindle, A., Stroulia, E.: Do the stars align? Multidimensional analysis of android’s layered architecture. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp. 124–127 (2012)
Guzman, E., Azócar, D., Li, Y.: Sentiment analysis of commit comments in GitHub: an empirical study. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp. 352–355. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2597073.2597118
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hammad, M., Hammad, M., Bani-Salameh, H.: Identifying designers and their design knowledge. Int. J. Softw. Eng. Its Appl. 7(6), 277–288 (2013). https://doi.org/10.14257/ijseia.2013.7.6.23
Han, J., Jung, W.: Extracting communication structure of a development organization from a software repository. Pers. Ubiquit. Comput. 18(6), 1413–1421 (2014). https://doi.org/10.1007/s00779-013-0742-3
Hassan, A., Holt, R.: Replaying development history to assess the effectiveness of change propagation tools. Empir. Softw. Eng. 11(3), 335–367 (2006). https://doi.org/10.1007/s10664-006-9006-4
Hindle, A.: Green mining: a methodology of relating software change and configuration to power consumption. Empir. Softw. Eng. 20(2), 374–409 (2015). https://doi.org/10.1007/s10664-013-9276-6
Holmes, R., Walker, R.J.: A newbie’s guide to eclipse APIs. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10–11, 2008, Proceedings, pp. 149–152 (2008). https://doi.org/10.1145/1370750.1370787
Hora, A., Anquetil, N., Etien, A., Ducasse, S., Valente, M.: Automatic detection of system-specific conventions unknown to developers. J. Syst. Softw. 109, 192–204 (2015). https://doi.org/10.1016/j.jss.2015.08.007
Jacobson, I., Booch, G., Rumbaugh, J.: The Unified Software Development Process. Addison-Wesley Longman Publishing Co., Inc., Boston, MA (1999)
Kagdi, H., Gethers, M., Poshyvanyk, D., Hammad, M.: Assigning change requests to software developers. J. Softw. Evolut. Process 24(1), 3–33 (2012). https://doi.org/10.1002/smr.530
Kamei, Y., Shihab, E., Adams, B., Hassan, A., Mockus, A., Sinha, A., Ubayashi, N.: A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 39(6), 757–773 (2013). https://doi.org/10.1109/TSE.2012.70
Khomh, F., Penta, M., Guhneuc, Y.G., Antoniol, G.: An exploratory study of the impact of antipatterns on class change- and fault-proneness. Empir. Softw. Eng. 17(3), 243–275 (2012). https://doi.org/10.1007/s10664-011-9171-y
Kim, S., Shivaji, S., Whitehead Jr., E.: Kenyon-web: Reconfigurable web-based feature extractor. Vancouver, BC, pp. 287–288 (2009). https://doi.org/10.1109/ICPC.2009.5090061
Kirbas, S., Caglayan, B., Hall, T., Counsell, S., Bowes, D., Sen, A., Bener, A.: The relationship between evolutionary coupling and defects in large industrial software. J. Softw. Evolut. Process (2017). https://doi.org/10.1002/smr.1842
Krinke, J., Gold, N., Jia, Y., Binkley, D.: Cloning and copying between gnome projects. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pp. 98–101 (2010)
Kumaresh, S., Baskaran, R.: Mining software repositories for defect categorization. J. Commun. Softw. Syst. 11(1), 31–36 (2015)
Lehman, M.M., Belady, L.A. (eds.): Program Evolution: Processes of Software Change, 1st edn. Academic Press Professional Inc, San Diego, CA (1985)
Li, H., Shang, W., Zou, Y., Hassan, A.E.: Towards just-in-time suggestions for log changes. Empir. Softw. Eng. 22(4), 1831–1865 (2017). https://doi.org/10.1007/s10664-016-9467-z
Linares-Vásquez, M., Vendome, C., Tufano, M., Poshyvanyk, D.: How developers micro-optimize android apps. J. Syst. Softw. 130, 1–23 (2017). https://doi.org/10.1016/j.jss.2017.04.018
Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., Baldi, P.: Mining eclipse developer contributions via author-topic models. In: Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), pp. 30–30 (2007)
López-Fernández, L., Robles, G., Gonzalez-Barahona, J., Herraiz, I.: Applying social network analysis techniques to community-driven libre software projects. Int. J. Inf. Technol. Web Eng. (IJITWE) 1(3), 27–48 (2006). https://doi.org/10.4018/jitwe.2006070103
Louridas, P., Ebert, C.: Machine learning. IEEE Softw. 33(5), 110–115 (2016). https://doi.org/10.1109/MS.2016.114
Munaiah, N., Camilo, F., Wigham, W., Meneely, A., Nagappan, M.: Do bugs foreshadow vulnerabilities? An in-depth study of the chromium project. Empir. Softw. Eng. 22(3), 1305–1347 (2017). https://doi.org/10.1007/s10664-016-9447-3
Munaiah, N., Kroh, S., Cabrey, C., Nagappan, M.: Curating github for engineered software projects. Empir. Softw. Eng. 22(6), 3219–3253 (2017). https://doi.org/10.1007/s10664-017-9512-6
Penta, M., Cerulo, L., Aversano, L.: The life and death of statically detected vulnerabilities: an empirical study. Inf. Softw. Technol. 51(10), 1469–1484 (2009). https://doi.org/10.1016/j.infsof.2009.04.013
Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)
Prechelt, L., Pepper, A.: Why software repositories are not used for defect-insertion circumstance analysis more often: a case study. Inf. Softw. Technol. 56(10), 1377–1389 (2014). https://doi.org/10.1016/j.infsof.2014.05.001
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)
Rebouças, M., Santos, R.O., Pinto, G., Castor, F.: How does contributors’ involvement influence the build status of an open-source software project? In: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, pp. 475–478. IEEE Press, Piscataway, NJ, USA (2017). https://doi.org/10.1109/MSR.2017.32
Robles, G., Gonzalez-Barahona, J.: Contributor turnover in libre software projects. IFIP Int. Fed. Inf. Process. 203, 273–286 (2006). https://doi.org/10.1007/0-387-34226-5_28
Santos, E.A., Hindle, A.: Judging a commit by its cover: Correlating commit message entropy with build status on travis-ci. In: Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, pp. 504–507. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2901739.2903493
Schröter, A.: Msr challenge 2011: Eclipse, netbeans, firefox, and chrome. In: Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pp. 227–229. ACM, New York, NY, USA (2011). https://doi.org/10.1145/1985441.1985478
Shihab, E., Jiang, Z.M., Hassan, A.E.: On the use of internet relay chat (IRC) meetings by developers of the gnome gtk+ project. In: 2009 6th IEEE International Working Conference on Mining Software Repositories, pp. 107–110 (2009)
Sun, X., Li, B., Duan, Y., Shi, W., Liu, X.: Mining software repositories for automatic interface recommendation. Sci. Program. (2016). https://doi.org/10.1155/2016/5475964
Sun, X., Li, B., Leung, H., Li, B., Li, Y.: Msr4sm: using topic models to effectively mining software repositories for software maintenance tasks. Inf. Softw. Technol. 66, 1–12 (2015). https://doi.org/10.1016/j.infsof.2015.05.003
Sun, Y., Wang, Q., Yang, Y.: Frlink: improving the recovery of missing issue-commit links by revisiting file relevance. Inf. Softw. Technol. 84, 33–47 (2017). https://doi.org/10.1016/j.infsof.2016.11.010
Tappolet, J., Kiefer, C., Bernstein, A.: Semantic web enabled software analysis. J. Web Semant. 8(2–3), 225–240 (2010). https://doi.org/10.1016/j.websem.2010.04.009
Teixeira, J., Robles, G., Gonzlez-Barahona, J.: Lessons learned from applying social network analysis on an industrial free/libre/open source software ecosystem. J. Internet Serv. Appl. 6(1), 14 (2015). https://doi.org/10.1186/s13174-015-0028-2
Thummalapenta, S., Cerulo, L., Aversano, L., Di Penta, M.: An empirical study on the maintenance of source code clones. Empir. Softw. Eng. 15(1), 1–34 (2010). https://doi.org/10.1007/s10664-009-9108-x
Vanya, A., Klusener, S., Premraj, R., Van Vliet, H.: Supporting software architects to improve their software system’s decomposition—lessons learned. J. Softw. Evolut. Process 25(3), 219–232 (2013). https://doi.org/10.1002/smr.574
Vendome, C., Bavota, G., Penta, M., Linares-Vsquez, M., German, D., Poshyvanyk, D.: License usage and changes: a large-scale study on github. Empir. Softw. Eng. 22(3), 1537–1577 (2017). https://doi.org/10.1007/s10664-016-9438-4
Voinea, L., Telea, A.: Visual querying and analysis of large software repositories. Empir. Softw. Eng. 14(3), 316–340 (2009). https://doi.org/10.1007/s10664-008-9068-6
Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X.: Towards effective bug triage with software data reduction techniques. IEEE Trans. Knowl. Data Eng. 27(1), 264–280 (2015). https://doi.org/10.1109/TKDE.2014.2324590
Yamashita, K., Kamei, Y., McIntosh, S., Hassan, A., Ubayashi, N.: Magnet or sticky? Measuring project characteristics from the perspective of developer attraction and retention. J. Inf. Process. 24(2), 339–348 (2016). https://doi.org/10.2197/ipsjjip.24.339
Yuan, Z., Yu, L.L., Liu, C.: Bug prediction method for fine-grained source code changes. Ruan Jian Xue Bao/J. Softw. 25(11), 2499–2517 (2014). https://doi.org/10.13328/j.cnki.jos.004559
Zamani, S., Lee, S., Shokripour, R., Anvik, J.: A feature location approach supported by time-aware weighting of terms associated with developer expertise profiles. Knowl. Inf. Syst. 49(2), 629–659 (2016). https://doi.org/10.1007/s10115-015-0909-5
Zhou, M., Mockus, A.: Who will stay in the floss community? Modeling participant’s initial behavior. IEEE Trans. Software Eng. 41(1), 82–99 (2015). https://doi.org/10.1109/TSE.2014.2349496
Acknowledgements
We would like to thank the Ministerio de Economía y Competitividad of the Spanish Government for financing the project TIN2015-67534-P (MINECO/FEDER, UE) and the Junta de Castilla y León for financing the project BU085P17 (JCyL/FEDER, UE) both cofinanced from European Union FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Güemes-Peña, D., López-Nozal, C., Marticorena-Sánchez, R. et al. Emerging topics in mining software repositories. Prog Artif Intell 7, 237–247 (2018). https://doi.org/10.1007/s13748-018-0147-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-018-0147-7