research-article

Open access

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Authors:

David Schindler,

Felix Bensmann,

Frank KrügerAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 4574 - 4583

https://doi.org/10.1145/3459637.3482017

Published: 30 October 2021 Publication History

Abstract

Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci-Software Mentions in Science-a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: K=.82) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. Finally, we sketch potential use cases and provide baseline results.

References

[1]

Alice Allen, Peter J. Teuben, and P. Wesley Ryan. 2018. Schroedinger's Code: A Preliminary Study on Research Source Code Availability and Link Persistence in Astrophysics. The Astrophysical Journal Supplement Series 236, 1 (May 2018), 10. https://doi.org/10.3847/1538--4365/aab764

[2]

Tim Berners-Lee. 2010. Is your linked open data 5 star? http://www.w3.org/DesignIssues/LinkedData#fivestar

[3]

Caifan Du, Johanna Cohoon, Patrice Lopez, and James Howison. 2021. Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology 72, 7 (2021), 870--884. https://doi.org/10.1002/asi.24454 arXiv:https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/asi.24454

Digital Library

[4]

Geraint Duck, Goran Nenadic, Andy Brass, David L Robertson, and Robert Stevens. 2013. bioNerDS: exploring bioinformatics' database and software use through literature mining. BMC bioinformatics 14, 1 (2013), 194. https://doi.org/10.1186/1471--2105--14--194

[5]

Geraint Duck, Goran Nenadic, Michele Filannino, Andy Brass, David L Robertson, and Robert Stevens. 2016. A survey of bioinformatics database and software usage through mining the literature. PloS one 11, 6 (2016), e0157989. https://doi.org/10.1371/journal.pone.0157989

[6]

Michael Färber. 2019. The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data. In The Semantic Web -- ISWC 2019. Springer International Publishing, 113--129. https://doi.org/10.1007/978--3-030--30796--7_8

[7]

Daniel Garijo, Maximiliano Osorio, Deborah Khider, Varun Ratnakar, and Yolanda Gil. 2019. OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata. In 2019 15th International Conference on eScience (eScience). IEEE, 349--358. https://doi.org/10.1109/escience.2019.00046

[8]

Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer. 2013. Integrating NLP Using Linked Data. In Advanced Information Systems Engineering. Springer Berlin Heidelberg, 98--113. https://doi.org/10.1007/978--3--642--41338--4_7

[9]

James Howison and Julia Bullard. 2016. Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology 67, 9 (2016), 2137--2155. https://doi.org/10.1002/asi.23538

Digital Library

[10]

George Hripcsak and Adam S. Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association 12, 3 (2005), 296--298. https://doi.org/10.1197/jamia.M1733

[11]

Mohamad Yaser Jaradeh, Allard Oelen, Kheir Eddine Farfar, Manuel Prinz, Jennifer D'Souza, Gábor Kismihók, Markus Stocker, and Sören Auer. 2019. Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. In Proceedings of the 10th International Conference on Knowledge Capture (Marina Del Rey, CA, USA) (K-CAP '19). Association for Computing Machinery, New York, NY, USA, 243--246. https://doi.org/10.1145/3360901.3364435

Digital Library

[12]

Daniel Katz, Neil Chue Hong, Tim Clark, August Muench, Shelley Stall, Daina Bouquin, Matthew Cannon, et al. 2021. Recognizing the value of software: a software citation guide. F1000Research 9 (Jan. 2021), 1257. https://doi.org/10.12688/f1000research.26932.2

[13]

Frank Krüger and David Schindler. 2020. A Literature Review on Methods for the Extraction of Usage Statements of Software and Data. Computing in Science & Engineering 22, 1 (Jan. 2020), 26--38. https://doi.org/10.1109/mcse.2019.2943847

[14]

Kai Li, Erjia Yan, and Yuanyuan Feng. 2017. How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics 11, 4 (2017), 989--1002. https://doi.org/10.1016/j.joi.2017.08.003

[15]

James Malone, Andy Brown, Allyson L Lister, Jon Ison, Duncan Hull, Helen Parkinson, and Robert Stevens. 2014. The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation. Journal of Biomedical Semantics 5, 1 (2014), 25. https://doi.org/10.1186/2041--1480--5--25

[16]

Paolo Manghi, Alessia Bardi, Claudio Atzori, Miriam Baglioni, Natalia Manola, Jochen Schirrwagen, and Pedro Principe. 2019. The OpenAIRE Research Graph Data Model. https://doi.org/10.5281/ZENODO.2643199

[17]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (Lake Tahoe, Nevada) (NIPS'13). Curran Associates Inc., Red Hook, NY, USA, 3111--3119.

Digital Library

[18]

U. Nangia and D. S. Katz. 2017. Understanding Software in Research: Initial Results from Examining Nature and a Call for Collaboration. In 2017 IEEE 13th International Conference on e-Science (e-Science). IEEE, 486--487. https://doi.org/10.1109/eScience.2017.78

[19]

Xuelian Pan, Erjia Yan, Qianqian Wang, and Weina Hua. 2015. Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics 9, 4 (2015), 860--871. https://doi.org/10.1016/j.joi.2015.07.012

[20]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035.

Digital Library

[21]

Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50.

[22]

Jason Ronallo. 2012. HTML5 Microdata and Schema.org. Code4Lib Journal 16 (2012).

[23]

David Schindler, Felix Bensmann, Stefan Dietze, and Frank Krüger. 2021. SoMeSci. https://doi.org/10.5281/zenodo.4968738

[24]

David Schindler, Kristina Yordanova, and Frank Krüger. 2019. An annotation scheme for references to research artefacts in scientific publications. In 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 52--57. https://doi.org/10.1109/PERCOMW.2019.8730730

[25]

David Schindler, Benjamin Zapilko, and Frank Krüger. 2020. Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In ESWC 2020: The Semantic Web. Springer International Publishing, Cham, 271--286. https://doi.org/10.1007/978--3-030--49461--2_16

[26]

Arfon M. Smith, Daniel S. Katz, and Kyle E. Niemeyer. 2016. Software citation principles. PeerJ Computer Science 2 (2016), e86. https://doi.org/10.7717/peerj-cs.86

[27]

Pontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Ananiadou, and Jun'ichi Tsujii. 2012. BRAT: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Avignon, France, 102--107. https://www.aclweb.org/anthology/E12--2021

Digital Library

Cited By

Schindler DHossain TSpors SKrüger F(2024)A multilevel analysis of data quality for formal software citationQuantitative Science Studies10.1162/qss_a_003095:3(637-667)Online publication date: 1-Aug-2024
https://doi.org/10.1162/qss_a_00309
Ito KMatsubara S(2024)Estimating Metadata of Research Artifacts to Enhance their Findability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678684(1-2)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678684
Zhang YXiao G(2024)Named Entity Recognition Datasets: A Classification FrameworkInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00456-117:1Online publication date: 28-Mar-2024
https://doi.org/10.1007/s44196-024-00456-1
Show More Cited By

Index Terms

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. World Wide Web
    1. Web data description languages
      1. Semantic web description languages
        Resource Description Framework (RDF)

Recommendations

DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
Re-ranking for joint named-entity recognition and linking
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Recognizing names and linking them to structured data is a fundamental task in text analysis. Existing approaches typically perform these two steps using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of ...
NEREL: a Russian information extraction dataset with rich annotation for nested entities, relations, and wikidata entity links
Abstract
This paper describes NEREL—a Russian news dataset suited for three tasks: nested named entity recognition, relation extraction, and entity linking. Compared to flat entities, nested named entities provide a richer and more complete annotation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
642
Total Downloads

Downloads (Last 12 months)271
Downloads (Last 6 weeks)31

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schindler DHossain TSpors SKrüger F(2024)A multilevel analysis of data quality for formal software citationQuantitative Science Studies10.1162/qss_a_003095:3(637-667)Online publication date: 1-Aug-2024
https://doi.org/10.1162/qss_a_00309
Ito KMatsubara S(2024)Estimating Metadata of Research Artifacts to Enhance their Findability2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678684(1-2)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678684
Zhang YXiao G(2024)Named Entity Recognition Datasets: A Classification FrameworkInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00456-117:1Online publication date: 28-Mar-2024
https://doi.org/10.1007/s44196-024-00456-1
Stankovski AGarijo D(2024)RepoFromPaper: An Approach to Extract Software Code Implementations from Scientific PublicationsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_7(100-113)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65794-8_7
Istrate AFisher JYang XMoraw KLi KLi DKlein M(2024)Scientific Software Citation Intent Classification Using Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_6(80-99)Online publication date: 15-Aug-2024
https://doi.org/10.1007/978-3-031-65794-8_6
Otto WUpadhyaya SDietze S(2024)Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_21(289-306)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65794-8_21
Khan ARamadan QYang CBoukhers Z(2024)Falcon 7b for Software Mention Detection in Scholarly DocumentsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_20(278-288)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65794-8_20
Nguyen Xuan PTran Minh QDang Van T(2024)ABCD Team at SOMD 2024: Software Mention Detection in Scholarly Publications with Large Language ModelsNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_19(267-277)Online publication date: 15-Aug-2024
https://doi.org/10.1007/978-3-031-65794-8_19
Nguyen Thi TNguyen Viet ADang Van TLuu-Thuy Nguyen N(2024)Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024Natural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_18(257-266)Online publication date: 15-Aug-2024
https://doi.org/10.1007/978-3-031-65794-8_18
Krüger FKarmakar SDietze S(2024)SOMD@NSLP2024: Overview and Insights from the Software Mention Detection Shared TaskNatural Scientific Language Processing and Research Knowledge Graphs10.1007/978-3-031-65794-8_17(247-256)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-65794-8_17
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents