Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3379597.3387481acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Polyglot and Distributed Software Repository Mining with Crossflow

Published: 18 September 2020 Publication History

Abstract

Mining software repositories at a large scale typically requires substantial computational and storage resources. This creates an increasing need for repository mining programs to be executed in a distributed manner, such that remote collaborators can contribute local computational and storage resources. In this paper we present Crossflow, a novel framework for building polyglot distributed repository mining programs. We demonstrate how Crossflow offers delegation of mining jobs to remote workers and can cache their results, how such workers are able to implement advanced behavior like load balancing and rejecting jobs they either cannot perform or would execute sub-optimally, and how workers of the same analysis program can be written in different programing languages like Java and Python, executing only relevant parts of the program described in that language.

References

[1]
David P Anderson. 2004. Boinc: A system for public-resource computing and storage. Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing (2004), 4--10.
[2]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[3]
Santiago Dueñas, Valerio Cosentino, Gregorio Robles, and Jesus M Gonzalez-Barahona. 2018. Perceval: software project data at your will. Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings (2018), 1--4.
[4]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. Proceedings of the 35th International Conference on Software Engineering (2013), 422--431.
[5]
R. Ferenc, L. Langó, I. Siket, T. Gyimóthy, and T. Bakota. 2014. Source Meter Sonar Qube Plug-in. 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation (Sep. 2014), 77--82.
[6]
Frédéric Jouault and Ivan Kurtev. 2005. Transforming Models with the ATL. Proceedings of the Model Transformations in Practice Workshop at MoDELS 2005 3844 (October 2005), 128--138.
[7]
Georgios Gousios, Eirini Kalliamvakou, and Diomidis Spinellis. 2008. Measuring developer contribution from software repository data. Proceedings of the 5th International Conference on Mining Software Repositories (2008), 129--132.
[8]
Georgios Gousios and Diomidis Spinellis. 2009. Alitheia core: An extensible software quality monitoring platform. Proceedings of the IEEE 31st International Conference on Software Engineering (2009), 579--582.
[9]
Dimitris Kolovos, Patrick Neubauer, Konstantinos Barmpis, Nicholas Matragkas, and Richard Paige. 2019. Crossflow: A Framework for Distributed Mining of Software Repositories. Proceedings of the 16th International Conference on Mining Software Repositories (2019), 155--159. https://doi.org/10.1109/MSR.2019.00032
[10]
Dimitrios S Kolovos, Nicholas Drivalos Matragkas, Ioannis Korkontzelos, Sophia Ananiadou, and Richard F Paige. 2015. Assessing the Use of Eclipse MDE Technologies in Open-Source Software Projects. OSS4MDE@ MoDELS (2015), 20--29.
[11]
Gregorio Robles, Jesús M. González-Barahona, Carlos Cervigón, Andrea Capiluppi, and Daniel Izquierdo-Cortazar. 2014. Estimating development effort in Free/Open source software projects by mining software repositories: a case study of OpenStack. Proceedings of the 11th International Conference on Mining Software Repositories (2014), 222--231.
[12]
Bruce Snyder, Dejan Bosnanac, and Rob Davies. 2011. ActiveMQ in action. Vol. 47. Manning Greenwich Conn.
[13]
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2018), 908--911.
[14]
Fabian Trautsch, Steffen Herbold, Philip Makedonski, and Jens Grabowski. 2016. Adressing problems with external validity of repository mining studies through a smart data platform. Proceedings of the 13th International Conference on Mining Software Repositories (2016), 97--108.
[15]
Fabian Trautsch, Steffen Herbold, Philip Makedonski, and Jens Grabowski. 2018. Addressing problems with replicability and validity of repository mining studies through a smart data platform. Empirical Software Engineering 23, 2 (2018), 1036--1083.
[16]
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories
June 2020
675 pages
ISBN:9781450375177
DOI:10.1145/3379597
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Mining software repositories
  2. domain-specific modeling language
  3. ease of use
  4. lower barrier to entry
  5. scalable

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • EPSRC
  • Horizon 2020 Research and Innovation Programme

Conference

MSR '20
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 152
    Total Downloads
  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media