Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project Health

Published: 14 March 2024 Publication History

Abstract

When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g., the number of closed pull requests in 12 months time). The training data for this task may be very small (e.g., 5 years of data, collected every month means just 60 rows of training data). The models generated from such tiny datasets can make many prediction errors.
Those errors can be tamed by a landscape analysis that selects better learner control parameters. Our niSNEAK tool (a) clusters the data to find the general landscape of the hyperparameters, then (b) explores a few representatives from each part of that landscape. niSNEAK is both faster and more effective than prior state-of-the-art hyperparameter optimization algorithms (e.g., FLASH, HYPEROPT, OPTUNA).
The configurations found by niSNEAK have far less error than other methods. For example, for project health indicators such as C = number of commits, I = number of closed issues, and R = number of closed pull requests, niSNEAK’s 12-month prediction errors are {I=0%, R=33% C=47%}, whereas other methods have far larger errors of {I=61%,R=119% C=149%}. We conjecture that niSNEAK works so well since it finds the most informative regions of the hyperparameters, then jumps to those regions. Other methods (that do not reflect over the landscape) can waste time exploring less informative options.
Based on the preceding, we recommend landscape analytics (e.g., niSNEAK) especially when learning from very small datasets. This article only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems.
To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak.

References

[1]
Amritanshu Agrawal, Wei Fu, Di Chen, Xipeng Shen, and Tim Menzies. 2019. How to “dodge” complex software analytics. IEEE Transactions on Software Engineering 47, 10 (2019), 2182–2194.
[2]
Amritanshu Agrawal and Tim Menzies. 2018. Is “better data” better than “better data miners”? In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE ’18). IEEE, Los Alamitos, CA, 1050–1061.
[3]
Amritanshu Agrawal, Tim Menzies, Leandro L. Minku, Markus Wagner, and Zhe Yu. 2020. Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers. Empirical Software Engineering 25, 3 (2020), 2099–2136.
[4]
Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Rahul Yedida, Xipeng Shen, and Tim Menzies. 2022. Simpler hyperparameter optimization for software analytics: Why, how, when. IEEE Transactions on Software Engineering 48, 8 (2022), 2939–2954.
[5]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2623–2631.
[6]
Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 2011 33rd International Conference on Software Engineering (ICSE ’11). IEEE, Los Alamitos, CA, 1–10.
[7]
Ahmed BaniMustafa. 2018. Predicting software effort estimation using machine learning techniques. In Proceedings of the 2018 8th International Conference on Computer Science and Information Technology (CSIT ’18). IEEE, Los Alamitos, CA, 249–256.
[8]
Lingfeng Bao, Xin Xia, David Lo, and Gail C. Murphy. 2019. A large scale study of long-time contributor prediction for GitHub projects. IEEE Transactions on Software Engineering 47, 6 (2019), 1277–1298.
[9]
Meriema Belaidouni and Jin-Kao Hao. 1999. Landscapes and the maximal constraint satisfaction problem. In Artificial Evolution. Lecture Notes in Computer Science, Vol. 1829. Springer, 242–253.
[10]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems 24 (2011), 2546–2554.
[11]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 2 (2012), 281–305.
[12]
James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins, and David D. Cox. 2015. Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Science & Discovery 8, 1 (2015), 014008.
[13]
Anna Sergeevna Bosman, Andries Engelbrecht, and Mardé Helbig. 2020. Visualising basins of attraction for the cross-entropy and the squared error neural network loss functions. Neurocomputing 400 (2020), 113–136.
[14]
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108–122.
[15]
Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (Eds.). 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA. http://dblp.uni-trier.de/db/books/collections/CSZ2006.html
[16]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.
[17]
Fangwei Chen, Lei Li, Jing Jiang, and Li Zhang. 2014. Predicting the number of forks for open source software project. In Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies. 40–47.
[18]
Jianfeng Chen, Vivek Nair, Rahul Krishna, and Tim Menzies. 2018. “Sampling” as a baseline optimizer for search-based software engineering. IEEE Transactions on Software Engineering 45, 6 (2018), 597–614.
[19]
Sunita Chulani, Barry Boehm, and Bert Steece. 1999. Calibrating software cost models using Bayesian analysis. IEEE Transactions on Software Engineering 573583 (1999), 1–11.
[20]
Kevin Crowston and James Howison. 2006. Assessing the health of open source communities. Computer 39, 5 (2006), 89–91.
[21]
Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7 (2006), 1–30.
[22]
Christos Faloutsos and King-Ip Lin. 1995. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. 163–174.
[23]
Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics 11, 1 (1940), 86–92.
[24]
Wei Fu, Tim Menzies, and Xipeng Shen. 2016. Tuning for software analytics: Is it really necessary? Information and Software Technology 76 (2016), 135–146.
[25]
Wei Fu, Vivek Nair, and Tim Menzies. 2016. Why is differential evolution better than grid search for tuning defect predictors? arXiv preprint arXiv:1609.02613 (2016).
[26]
Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, and Jianwei Yin. 2019. Characterization and prediction of popular projects on GitHub. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC ’19), Vol. 1. IEEE, Los Alamitos, CA, 21–26.
[27]
Steffen Herbold. 2017. Comments on ScottKnottESD in response to “An empirical comparison of model validation techniques for defect prediction models.” IEEE Transactions on Software Engineering 43, 11 (2017), 1091–1094.
[28]
Steffen Herbold, Alexander Trautsch, and Jens Grabowski. 2018. Correction of “A comparative study to benchmark cross-project defect prediction approaches.” IEEE Transactions on Software Engineering 45, 6 (2018), 632–636.
[29]
John H. Holland. 1992. Genetic algorithms. Scientific American 267, 1 (1992), 66–73. http://www.jstor.org/stable/24939139
[30]
Slinger Jansen. 2014. Measuring the health of open source software ecosystems: Beyond the scope of project health. Information and Software Technology 56, 11 (2014), 1508–1519.
[31]
Oskar Jarczyk, Szymon Jaroszewicz, Adam Wierzbicki, Kamil Pawlak, and Michal Jankowski-Lorek. 2018. Surgical teams on GitHub: Modeling performance of GitHub project development processes. Information and Software Technology 100 (2018), 32–46.
[32]
Riivo Kikas, Marlon Dumas, and Dietmar Pfahl. 2016. Using dynamic and contextual features to predict issue lifetime in GitHub projects. In Proceedings of the 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR ’16). IEEE, Los Alamitos, CA, 291–302.
[33]
Georg J. P. Link and Matt Germonprez. 2018. Assessing open source project health. In Proceedings of the 2018 24th Americas Conference on Information Systems.
[34]
Jun Liu, Prem Timsina, and Omar El-Gayar. 2016. A comparative analysis of semi-supervised learning: The case of article selection for medical systematic reviews. Information Systems Frontiers 20 (2016), 195–207. DOI:
[35]
S. Majumder and T. Menzies. 2022. Methods for stabilizing models across large samples of projects (with case studies on predicting defect and project health). In Proceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). 566–578.
[36]
Katherine M. Malan. 2021. A survey of advances in landscape analysis for optimisation. Algorithms 14, 2 (2021), 40. DOI:
[37]
Konstantinos Manikas and Klaus Marius Hansen. 2013. Reviewing the health of software ecosystems—A conceptual framework proposal. In Proceedings of the 5th International Workshop on Software Ecosystems (IWSECO ’13). 33–44.
[38]
Aleksei Mashlakov, Ville Tikka, Lasse Lensu, Aleksei Romanenko, and Samuli Honkapuro. 2019. Hyper-parameter optimization of multi-attention recurrent neural network for battery state-of-charge forecasting. In Progress in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 11804. Springer, 482–494.
[39]
Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike Preuss, Claus Weihs, and Günter Rudolph. 2011. Exploratory landscape analysis. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation. 829–836.
[40]
Zakrani Abdelali, Hain Mustapha, and Namir Abdelwahed. 2019. Investigating the use of random forest in software effort estimation. Procedia Computer Science 148 (2019), 343–352.
[41]
Vivek Nair, Zhe Yu, Tim Menzies, Norbert Siegmund, and Sven Apel. 2018. Finding faster configurations using FLASH. IEEE Transactions on Software Engineering 46, 7 (2018), 794–811.
[42]
Peter Bjorn Nemenyi. 1963. Distribution-Free Multiple Comparisons.Princeton University.
[43]
Meetesh Nevendra and Pradeep Singh. 2022. Empirical investigation of hyperparameter optimization for software defect count prediction. Expert Systems with Applications 191 (2022), 116217.
[44]
Gabriela Ochoa, Marco Tomassini, Sebástien Vérel, and Christian Darabos. 2008. A study of NK landscapes’ basins and local optima networks. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation. 555–562.
[45]
Yoshihiko Ozaki, Yuki Tanigaki, Shuhei Watanabe, and Masaki Onishi. 2020. Multiobjective tree-structured Parzen estimator for computationally expensive optimization problems. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference. 533–541.
[46]
Fumin Qi, Xiao-Yuan Jing, Xiaoke Zhu, Xiaoyuan Xie, Baowen Xu, and Shi Ying. 2017. Software effort estimation based on open source projects: Case study of GitHub. Information and Software Technology 92 (2017), 145–157.
[47]
P. V. G. D. Prasad Reddy, K. R. Sudha, P. Rama Sree, and S. N. S. V. S. C. Ramesh. 2010. Software effort estimation using radial basis and generalized regression neural networks. arXiv preprint arXiv:1005.4021 (2010).
[48]
Gregorio Robles, Andrea Capiluppi, Jesus M. Gonzalez-Barahona, Bjorn Lundell, and Jonas Gamalielsson. 2022. Development effort estimation in free/open source software from activity in version control systems. arXiv preprint arXiv:2203.09898 (2022).
[49]
Rasmus Ros, Elizabeth Bjarnason, and Per Runeson. 2017. A machine learning approach for semi-automated search and selection in literature studies. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering. ACM, New York, NY, 118–127.
[50]
Federica Sarro, Alessio Petrozziello, and Mark Harman. 2016. Multi-objective software effort estimation. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, 619–630. DOI:
[51]
Abdel Salam Sayyad, Tim Menzies, and Hany Ammar. 2013. On the value of user preferences in search-based software engineering: A case study in software product lines. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE, Los Alamitos, CA, 492–501. http://dl.acm.org/citation.cfm?id=2486788.2486853
[52]
Martin Shepperd and Steve MacDonell. 2012. Evaluating prediction systems in software project estimation. Information and Software Technology 54, 8 (2012), 820–827.
[53]
N. C. Shrikanth, Suvodeep Majumder, and Tim Menzies. 2021. Early life cycle software defect prediction. Why? How? In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE ’21). IEEE, Los Alamitos, CA, 448–459.
[54]
Rainer Storn and Kenneth Price. 1997. Differential evolution—A simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 11, 4 (1997), 341–359.
[55]
Huy Tu and Tim Menzies. 2021. FRUGAL: Unlocking SSL for software analytics. arXiv:2108.09847 [cs.SE] (2021).
[56]
H. Tu and T. Menzies. 2022. DebtFree: Minimizing labeling cost in self-admitted technical debt identification using semi-supervised learning. Empirical Software Engineering 27 (2022), 80.
[57]
H. Tu and T. Menzies. 2021. FRUGAL: Unlocking semi-supervised learning for software analytics. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE ’21). 394–406.
[58]
H. Tu, G. Papadimitriou, M. Kiran, C. Wang, A. Mandal, E. Deelman, and T. Menzies. 2021. Mining workflows for anomalous data transfers. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR ’21).
[59]
Jesper E. Van Engelen and Holger H. Hoos. 2020. A survey on semi-supervised learning. Machine Learning 109, 2 (2020), 373–440.
[60]
Tobias Wagner, Nicola Beume, and Boris Naujoks. 2007. Pareto-, aggregation-, and indicator-based methods in many-objective optimization. In Proceedings of the 4th International Conference on Evolutionary Multi-Criterion Optimization (EMO ’07).742–756. http://dl.acm.org/citation.cfm?id=1762545.1762608
[61]
Dindin Wahyudin, Khabib Mustofa, Alexander Schatten, Stefan Biffl, and A. Min Tjoa. 2007. Monitoring the “health” status of open source web-engineering projects. International Journal of Web Information Systems 3, 1-2 (2007), 116–139.
[62]
Simon Weber and Jiebo Luo. 2014. What makes an open source code popular on GitHub? In Proceedings of the 2014 IEEE International Conference on Data Mining Workshop. IEEE, Los Alamitos, CA, 851–855.
[63]
L. Darrell Whitley, Keith E. Mathias, and Larry D. Pyeatt. 1995. Hyperplane ranking in simple genetic algorithms. In Proceedings of the International Conference on Genetic Algorithms (ICGA ’95). 231–238.
[64]
Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.
[65]
S. Wright. 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proceedings of the XI International Congress of Genetics 8 (1932), 209–222.
[66]
Donald Wynn Jr. 2007. Assessing the health of an open source ecosystem. In Emerging Free and Open Source Software Practices. IGI Global, 238–258.
[67]
Tianpei Xia, Wei Fu, Rui Shu, Rishabh Agrawal, and Tim Menzies. 2022. Predicting health indicators for open source projects (using hyperparameter optimization). Empirical Software Engineering 27, 6 (2022), 1–31.
[68]
Tianpei Xia, Wei Fu, Rui Shu, and Tim Menzies. 2022. Predicting health indicators for open source projects (using hyperparameter optimization). Empirical Software Engineering 27, 6 (2022), 122. https://arxiv.org/pdf/2006.07240.pdf
[69]
Tianpei Xia, Rahul Krishna, Jianfeng Chen, George Mathew, Xipeng Shen, and Tim Menzies. 2018. Hyperparameter optimization for effort estimation. arXiv preprint arXiv:1805.00336 (2018).
[70]
Tianpei Xia, Rui Shu, Xipeng Shen, and Tim Menzies. 2020. Sequential model optimization for software effort estimation. IEEE Transactions on Software Engineering 48, 6 (2020), 1994–2009.
[71]
Rahul Yedida and Tim Menzies. 2022. On the value of oversampling for deep learning in software defect prediction. IEEE Transactions on Software Engineering 48, 8 (2022), 3103–3116.
[72]
Z. Yu, F. M. Fahid, H. Tu, and T. Menzies. 2022. Identifying self-admitted technical debts with jitterbug: A two-step approach. IEEE Transactions on Software Engineering 48, 5 (2022), 1676–1691.
[73]
Zhiguo Zhou, Meijuan Zhou, Zhilong Wang, and Xi Chen. 2022. Predicting treatment outcome in metastatic melanoma through automated multi-objective model with hyperparameter optimization. In Medical Imaging 2022: Image-Guided Procedures, Robotic Interventions, and Modeling, Vol. 12034. SPIE, 117–121.
[74]
Eckart Zitzler, Marco Laumanns, and Lothar Thiele. 2002. SPEA2: Improving the strength Pareto evolutionary algorithm for multiobjective optimization. In Evolutionary Methods for Design, Optimisation, and Control. CIMNE, Barcelona, Spain, 95–100.

Cited By

View all

Index Terms

  1. Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project Health

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 3
      March 2024
      943 pages
      EISSN:1557-7392
      DOI:10.1145/3613618
      • Editor:
      • Mauro Pezzé
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 March 2024
      Online AM: 09 November 2023
      Accepted: 09 October 2023
      Revised: 29 September 2023
      Received: 05 December 2022
      Published in TOSEM Volume 33, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Hyperparameter tuning
      2. software health
      3. indepedent variable clustering

      Qualifiers

      • Research-article

      Funding Sources

      • NSF CCF award

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 323
        Total Downloads
      • Downloads (Last 12 months)323
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 03 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media