Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3459637.3482017acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Published: 30 October 2021 Publication History

Abstract

Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci-Software Mentions in Science-a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: K=.82) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. Finally, we sketch potential use cases and provide baseline results.

References

[1]
Alice Allen, Peter J. Teuben, and P. Wesley Ryan. 2018. Schroedinger's Code: A Preliminary Study on Research Source Code Availability and Link Persistence in Astrophysics. The Astrophysical Journal Supplement Series 236, 1 (May 2018), 10. https://doi.org/10.3847/1538--4365/aab764
[2]
Tim Berners-Lee. 2010. Is your linked open data 5 star? http://www.w3.org/DesignIssues/LinkedData#fivestar
[3]
Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison. 2021. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology 72, 7 (2021), 870--884. https://doi.org/10.1002/asi.24454 arXiv:https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/asi.24454
[4]
Geraint Duck, Goran Nenadic, Andy Brass, David L Robertson, and Robert Stevens. 2013. bioNerDS: exploring bioinformatics' database and software use through literature mining. BMC bioinformatics 14, 1 (2013), 194. https://doi.org/10.1186/1471--2105--14--194
[5]
Geraint Duck, Goran Nenadic, Michele Filannino, Andy Brass, David L Robertson, and Robert Stevens. 2016. A survey of bioinformatics database and software usage through mining the literature. PloS one 11, 6 (2016), e0157989. https://doi.org/10.1371/journal.pone.0157989
[6]
Michael Färber. 2019. The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data. In The Semantic Web -- ISWC 2019. Springer International Publishing, 113--129. https://doi.org/10.1007/978--3-030--30796--7_8
[7]
Daniel Garijo, Maximiliano Osorio, Deborah Khider, Varun Ratnakar, and Yolanda Gil. 2019. OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata. In 2019 15th International Conference on eScience (eScience). IEEE, 349--358. https://doi.org/10.1109/escience.2019.00046
[8]
Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. 2013. Integrating NLP Using Linked Data. In Advanced Information Systems Engineering. Springer Berlin Heidelberg, 98--113. https://doi.org/10.1007/978--3--642--41338--4_7
[9]
James Howison and Julia Bullard. 2016. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology 67, 9 (2016), 2137--2155. https://doi.org/10.1002/asi.23538
[10]
George Hripcsak and Adam S. Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association 12, 3 (2005), 296--298. https://doi.org/10.1197/jamia.M1733
[11]
Mohamad Yaser Jaradeh, Allard Oelen, Kheir Eddine Farfar, Manuel Prinz, Jennifer D'Souza, Gábor Kismihók, Markus Stocker, and Sören Auer. 2019. Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. In Proceedings of the 10th International Conference on Knowledge Capture (Marina Del Rey, CA, USA) (K-CAP '19). Association for Computing Machinery, New York, NY, USA, 243--246. https://doi.org/10.1145/3360901.3364435
[12]
Daniel Katz, Neil Chue Hong, Tim Clark, August Muench, Shelley Stall, Daina Bouquin, Matthew Cannon, et al. 2021. Recognizing the value of software: a software citation guide. F1000Research 9 (Jan. 2021), 1257. https://doi.org/10.12688/f1000research.26932.2
[13]
Frank Krüger and David Schindler. 2020. A Literature Review on Methods for the Extraction of Usage Statements of Software and Data. Computing in Science & Engineering 22, 1 (Jan. 2020), 26--38. https://doi.org/10.1109/mcse.2019.2943847
[14]
Kai Li, Erjia Yan, and Yuanyuan Feng. 2017. How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics 11, 4 (2017), 989--1002. https://doi.org/10.1016/j.joi.2017.08.003
[15]
James Malone, Andy Brown, Allyson L Lister, Jon Ison, Duncan Hull, Helen Parkinson, and Robert Stevens. 2014. The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation. Journal of Biomedical Semantics 5, 1 (2014), 25. https://doi.org/10.1186/2041--1480--5--25
[16]
Paolo Manghi, Alessia Bardi, Claudio Atzori, Miriam Baglioni, Natalia Manola, Jochen Schirrwagen, and Pedro Principe. 2019. The OpenAIRE Research Graph Data Model. https://doi.org/10.5281/ZENODO.2643199
[17]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS'13). Curran Associates Inc., Red Hook, NY, USA, 3111--3119.
[18]
U. Nangia and D. S. Katz. 2017. Understanding Software in Research: Initial Results from Examining Nature and a Call for Collaboration. In 2017 IEEE 13th International Conference on e-Science (e-Science). IEEE, 486--487. https://doi.org/10.1109/eScience.2017.78
[19]
Xuelian Pan, Erjia Yan, Qianqian Wang, and Weina Hua. 2015. Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics 9, 4 (2015), 860--871. https://doi.org/10.1016/j.joi.2015.07.012
[20]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035.
[21]
Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50.
[22]
Jason Ronallo. 2012. HTML5 Microdata and Schema.org. Code4Lib Journal 16 (2012).
[23]
David Schindler, Felix Bensmann, Stefan Dietze, and Frank Krüger. 2021. SoMeSci. https://doi.org/10.5281/zenodo.4968738
[24]
David Schindler, Kristina Yordanova, and Frank Krüger. 2019. An annotation scheme for references to research artefacts in scientific publications. In 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 52--57. https://doi.org/10.1109/PERCOMW.2019.8730730
[25]
David Schindler, Benjamin Zapilko, and Frank Krüger. 2020. Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In ESWC 2020: The Semantic Web. Springer International Publishing, Cham, 271--286. https://doi.org/10.1007/978--3-030--49461--2_16
[26]
Arfon M. Smith, Daniel S. Katz, and Kyle E. Niemeyer. 2016. Software citation principles. PeerJ Computer Science 2 (2016), e86. https://doi.org/10.7717/peerj-cs.86
[27]
Pontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Ananiadou, and Jun'ichi Tsujii. 2012. BRAT: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Avignon, France, 102--107. https://www.aclweb.org/anthology/E12--2021

Cited By

View all
  • (2024)A multilevel analysis of data quality for formal software citationQuantitative Science Studies10.1162/qss_a_003095:3(637-667)Online publication date: 1-Aug-2024
  • (2024)Estimating Metadata of Research Artifacts to Enhance their Findability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678684(1-2)Online publication date: 16-Sep-2024
  • (2024)Named Entity Recognition Datasets: A Classification FrameworkInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00456-117:1Online publication date: 28-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity disambiguation
  2. entity linking
  3. knowledge graph
  4. named entity recognition
  5. relation extraction
  6. software mention

Qualifiers

  • Research-article

Funding Sources

  • Deutsche Forschungsgemeinschaft

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)271
  • Downloads (Last 6 weeks)31
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A multilevel analysis of data quality for formal software citationQuantitative Science Studies10.1162/qss_a_003095:3(637-667)Online publication date: 1-Aug-2024
  • (2024)Estimating Metadata of Research Artifacts to Enhance their Findability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678684(1-2)Online publication date: 16-Sep-2024
  • (2024)Named Entity Recognition Datasets: A Classification FrameworkInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00456-117:1Online publication date: 28-Mar-2024
  • (2024)RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific PublicationsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_7(100-113)Online publication date: 26-May-2024
  • (2024)Scientific Software Citation Intent Classification Using Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_6(80-99)Online publication date: 15-Aug-2024
  • (2024)Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_21(289-306)Online publication date: 26-May-2024
  • (2024)Falcon 7b for Software Mention Detection in Scholarly DocumentsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_20(278-288)Online publication date: 26-May-2024
  • (2024)ABCD Team at SOMD 2024: Software Mention Detection in Scholarly Publications with Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_19(267-277)Online publication date: 15-Aug-2024
  • (2024)Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024Natural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_18(257-266)Online publication date: 15-Aug-2024
  • (2024)SOMD@NSLP2024: Overview and Insights from the Software Mention Detection Shared TaskNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_17(247-256)Online publication date: 26-May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media