Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2597073.2597126acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Lean GHTorrent: GitHub data on demand

Published: 31 May 2014 Publication History

Abstract

In recent years, GitHub has become the largest code host in the world, with more than 5M developers collaborating across 10M repositories. Numerous popular open source projects (such as Ruby on Rails, Homebrew, Bootstrap, Django or jQuery) have chosen GitHub as their host and have migrated their code base to it. GitHub offers a tremendous research potential. For instance, it is a flagship for current open source development, a place for developers to showcase their expertise to peers or potential recruiters, and the platform where social coding features or pull requests emerged. However, GitHub data is, to date, largely underexplored. To facilitate studies of GitHub, we have created GHTorrent, a scalable, queriable, offline mirror of the data offered through the GitHub REST API. In this paper we present a novel feature of GHTorrent designed to offer customisable data dumps on demand. The new GHTorrent data-on-demand service offers users the possibility to request via a web form up-to-date GHTorrent data dumps for any collection of GitHub repositories. We hope that by offering customisable GHTorrent data dumps we will not only lower the "barrier for entry" even further for researchers interested in mining GitHub data (thus encourage researchers to intensify their mining efforts), but also enhance the replicability of GitHub studies (since a snapshot of the data on which the results were obtained can now easily accompany each study).

References

[1]
M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In MSR, pages 207–216. IEEE, 2013.
[2]
E. T. Barr, C. Bird, P. C. Rigby, A. Hindle, D. M. German, and P. Devanbu. Cohesive and isolated development with branches. In FASE, pages 316–331. Springer, 2012.
[3]
A. Begel, Y. P. Khoo, and T. Zimmermann. Codebook: discovering and exploiting relationships in software repositories. In ICSE, pages 125–134. IEEE, 2010.
[4]
L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social coding in Github: transparency and collaboration in an open software repository. In CSCW, pages 1277–1286. ACM, 2012.
[5]
A. Dan, R. Johnson, and A. Arsanjani. Information as a service: Modeling and realization. In International Workshop on Systems Development in SOA Environments, page 2. IEEE, 2007.
[6]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In ICSE, pages 422–431. IEEE, 2013.
[7]
J. M. González-Barahona and G. Robles. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empirical Software Engineering, 17(1-2):75–89, 2012.
[8]
G. Gousios. The GHTorent dataset and tool suite. In MSR, pages 233–236. IEEE, 2013.
[9]
G. Gousios, M. Pinzger, and A. van Deursen. An exploratory study of the pull-based software development model. In ICSE. ACM, 2014.
[10]
G. Gousios and D. Spinellis. GHTorrent: Github’s data from a firehose. In MSR, pages 12–21. IEEE, 2012.
[11]
V. Gruhn, C. Hannebauer, and C. John. Security of public continuous integration services. In WikiSym, pages 15:1–15:10. ACM, 2013.
[12]
B. Heller, E. Marschner, E. Rosenfeld, and J. Heer. Visualizing collaboration and influence in the open-source software community. In MSR, pages 223–226. ACM, 2011.
[13]
J. Howison, M. Conklin, and K. Crowston. FLOSSmole: A collaborative repository for FLOSS research data and analyses. IJIT, 1(3):17–26, 2006.
[14]
J. Jiang, L. Zhang, and L. Li. Understanding project dissemination on a social coding site. In WCRE, pages 132–141. IEEE, 2013.
[15]
H. Lee, B.-K. Seo, and E. Seo. A git source repository analysis tool based on a novel branch-oriented approach. In ICISA, pages 1–4. IEEE, 2013.
[16]
J. Marlow, L. Dabbish, and J. Herbsleb. Impression formation in online peer production: activity traces and personal profiles in Github. In CSCW, pages 117–128. ACM, 2013.
[17]
R. Pham, L. Singer, O. Liskin, F. Figueira Filho, and K. Schneider. Creating a shared understanding of testing culture on a social coding site. In ICSE, pages 112–121. IEEE, 2013.
[18]
R. Pham, L. Singer, and K. Schneider. Building test suites in social coding sites by leveraging drive-by commits. In ICSE, pages 1209–1212. IEEE, 2013.
[19]
D. Schall. Who to follow recommendation in large-scale online development communities. Information and Software Technology, 2013.
[20]
S. K. Sowe, L. Angelis, I. Stamelos, and Y. Manolopoulos. Using repository of repositories (RoRs) to study the growth of F/OSS projects: A meta-analysis research approach. In Open Source Development, Adoption and Innovation, volume 234 of IFIP, pages 147–160. Springer, 2007.
[21]
M. Squire. Forge++: The changing landscape of FLOSS development. In HICSS47. IEEE, 2014.
[22]
F. Thung, T. F. Bissyandé, D. Lo, and L. Jiang. Network structure of social coding in GitHub. In CSMR, pages 323–326. IEEE, 2013.
[23]
B. Vasilescu. Academic papers using Stack Overflow data. http://meta.stackoverflow.com/q/134495, 2012.
[24]
B. Vasilescu, V. Filkov, and A. Serebrenik. StackOverflow and GitHub: associations between software development and crowdsourced knowledge. In SocialCom, pages 188–195. IEEE, 2013.
[25]
L. Voinea and A. Telea. Mining software repositories with CVSgrab. In MSR, pages 167–168. ACM, 2006.
[26]
P. Wagstrom, C. Jergensen, and A. Sarma. A network of rails: a graph dataset of ruby on rails and associated projects. In MSR, pages 229–232. IEEE, 2013.

Cited By

View all
  • (2024)Enhancing Performance Bug Prediction Using Performance Code MetricsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644920(50-62)Online publication date: 15-Apr-2024
  • (2024)A Software Bug Fixing Approach Based on Knowledge-Enhanced Large Language Models2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00026(169-179)Online publication date: 1-Jul-2024
  • (2024)Architectural Views: The State of Practice in Open-Source Software ProjectsSoftware Architecture10.1007/978-3-031-70797-1_27(396-415)Online publication date: 1-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR 2014: Proceedings of the 11th Working Conference on Mining Software Repositories
May 2014
427 pages
ISBN:9781450328630
DOI:10.1145/2597073
  • General Chair:
  • Premkumar Devanbu,
  • Program Chairs:
  • Sung Kim,
  • Martin Pinzger
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • TCSE: IEEE Computer Society's Tech. Council on Software Engin.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GitHub
  2. data on demand
  3. dataset

Qualifiers

  • Article

Conference

ICSE '14
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)10
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Performance Bug Prediction Using Performance Code MetricsProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644920(50-62)Online publication date: 15-Apr-2024
  • (2024)A Software Bug Fixing Approach Based on Knowledge-Enhanced Large Language Models2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00026(169-179)Online publication date: 1-Jul-2024
  • (2024)Architectural Views: The State of Practice in Open-Source Software ProjectsSoftware Architecture10.1007/978-3-031-70797-1_27(396-415)Online publication date: 1-Sep-2024
  • (2024)Towards Measuring Vulnerabilities and Exposures in Open-Source PackagesData Science—Analytics and Applications10.1007/978-3-031-42171-6_2(13-19)Online publication date: 4-Jan-2024
  • (2024)Two sides of the same coin: A study on developers' perception of defectsJournal of Software: Evolution and Process10.1002/smr.2699Online publication date: 18-Jun-2024
  • (2024)The impact of GitHub on students' learning and engagement in a software engineering courseComputer Applications in Engineering Education10.1002/cae.2277532:5Online publication date: 18-Jun-2024
  • (2023)How Close Is Existing C/C++ Code to a Safe Subset?Journal of Cybersecurity and Privacy10.3390/jcp40100014:1(1-22)Online publication date: 28-Dec-2023
  • (2023)The platformisation of software development: Connective coding and platform vernaculars on GitHubConvergence: The International Journal of Research into New Media Technologies10.1177/13548565231205867Online publication date: 20-Nov-2023
  • (2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
  • (2023)It’s like flossing your teeth: On the Importance and Challenges of Reproducible Builds for Software Supply Chain Security2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179320(1527-1544)Online publication date: May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media