Abstract
Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the results.
Supported by NSF-HDR-OAC: Biology-guided Neural Networks for Discovering Phenotypic Traits: 1940233 and 1940322m, NSF HDR-OAC: Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning: 2118240, and the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
FAIR Sharing Standards Registry. https://fairsharing.org/search?fairsharingRegistry=Standard
Introduction to BCO-DMO \(|\) BCO-DMO. https://www.bco-dmo.org/
Marine Environmental Research Infrastructure for Data Integration and Application Network, https://meridian.cs.dal.ca/
National Center for Biomedical Ontology BioPortal. https://bioportal.bioontology.org/
Phenoscape. https://phenoscape.org
Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re-use of public sector information (2003). http://data.europa.eu/eli/dir/2003/98/oj
EU-funded projects go public www.openaire.eu. MRS Bull. 37(8), 714 (2012). https://doi.org/10.1557/mrs.2012.193
Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (recast) (2019), http://data.europa.eu/eli/dir/2019/1024/oj/eng
DCMI Metadata Terms (2020). https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
Imageomics Institute (2021). https://imageomics.osu.edu/
Arencibia, E., Martinez, R., Marti-Lahera, Y., Goovaerts, M.: On metadata quality in Sceiba, a platform for quality control and monitoring of Cuban scientific publications. In: Garoufallou, E., Ovalle-Perandones, M.-A., Vlachidis, A. (eds.) MTSR 2021. CCIS, vol. 1537, pp. 106–113. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98876-0_9
Atkins, D.E., et al.: Revolutionizing science and engineering through cyberinfrastructure: report of the national science foundation blue-ribbon advisory panel on cyberinfrastructure. Technical report, National Science Foundation (2003). https://www.nsf.gov/cise/sci/reports/atkins.pdf
Bailey, C.B., Balakirev, F.F., Balakireva, L.L.: Closing the gap between FAIR data repositories and hierarchical data formats. Code4Lib J. (52) (2021). https://journal.code4lib.org/articles/16223
Ball, A.: Metadata standards directory (2016). https://www.youtube.com/watch?v=Lh8w2_TpFP8
Ball, A., Chen, S., Greenberg, J., Perez, C., Jeffery, K., Koskela, R.: Building a disciplinary metadata standards directory. Int. J. Digit. Curat. 9(1), 142–151 (2014). https://doi.org/10.2218/ijdc.v9i1.308
Batista, D., Gonzalez-Beltran, A., Sansone, S.A., Rocca-Serra, P.: Machine actionable metadata models. Sci. Data 9(1) (2022). https://doi.org/10.1038/s41597-022-01707-6
Brunet, M., Gilabert, A., Jones, P., Efthymiadis, D.: A historical surface climate dataset from station observations in Mediterranean North Africa and Middle East areas. Geosci. Data J. 1(2), 121–128 (2014). https://doi.org/10.1002/gdj3.12
Child, A.W., Hinds, J., Sheneman, L., Buerki, S.: Centralized project-specific metadata platforms: toolkit provides new perspectives on open data management within multi-institution and multidisciplinary research projects. BMC. Res. Notes 15(1), 106 (2022). https://doi.org/10.1186/s13104-022-05996-3
Chuttur, M.Y.: Perceived helpfulness of Dublin core semantics: an empirical study. In: Garoufallou, E., Greenberg, J. (eds.) MTSR 2013. CCIS, vol. 390, pp. 135–145. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03437-9_14
Courtot, M., Gupta, D., Liyanage, I., Xu, F., Burdett, T.: BioSamples database: FAIRer samples metadata to accelerate research data management. Nucleic Acids Res. 50(D1), D1500–D1507 (2022). https://doi.org/10.1093/nar/gkab1046
Dececchi, T.A., Balhoff, J.P., Lapp, H., Mabee, P.M.: Toward synthesizing our knowledge of morphology: using ontologies and machine reasoning to extract presence/absence evolutionary phenotypes across studies. Syst. Biol. 64(6), 936–952 (2015). https://doi.org/10.1093/sysbio/syv031
Diamantopoulos, N., Sgouropoulou, C., Kastrantas, K., Manouselis, N.: Developing a metadata application profile for sharing agricultural scientific and scholarly research resources. In: García-Barriocanal, E., Cebeci, Z., Okur, M.C., Öztürk, A. (eds.) MTSR 2011. CCIS, vol. 240, pp. 453–466. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24731-6_45
Edmunds, R.C., et al.: Phenoscape: identifying candidate genes for evolutionary phenotypes. Mol. Biol. Evol. 33(1), 13–24 (2016). https://doi.org/10.1093/molbev/msv223
Elberskirch, L., et al.: Digital research data: from analysis of existing standards to a scientific foundation for a modular metadata schema in nanosafety. Part. Fibre Toxicol. 19(1) (2022). https://doi.org/10.1186/s12989-021-00442-x
Elhamod, M., et al.: Hierarchy-guided neural networks for species classification. Preprint Evol. Biol. (2021). https://doi.org/10.1101/2021.01.17.427006
Fordham, D.A., et al.: Using paleo-archives to safeguard biodiversity under climate change. Science 369(6507), eabc5654 (2020). https://doi.org/10.1126/science.abc5654
Freire, N., Meijers, E., de Valk, S., Raemy, J.A., Isaac, A.: Metadata aggregation via linked data: results of the Europeana common culture project. In: Garoufallou, E., Ovalle-Perandones, M.-A. (eds.) MTSR 2020. CCIS, vol. 1355, pp. 383–394. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71903-6_35
Freire, N., Voorburg, R., Cornelissen, R., de Valk, S., Meijers, E., Isaac, A.: Aggregation of linked data in the cultural heritage domain: a case study in the Europeana network. Information 10(8), 252 (2019). https://doi.org/10.3390/info10080252
Gallas, E.J., Malon, D., Hawkings, R.J., Albrand, S., Torrence, E.: An integrated overview of metadata in ATLAS. J. Phys: Conf. Ser. 219(4), 042009 (2010). https://doi.org/10.1088/1742-6596/219/4/042009
tubri github: tubri-github/bgnn_api (2022). https://github.com/tubri-github/bgnn_API. Original-date: 2022-10-12T14:03:39Z
Greenberg, J., White, H.C., Carrier, S., Scherle, R.: A metadata best practice for a scientific data repository. J. Libr. Metadata 9(3–4), 194–212 (2009). https://doi.org/10.1080/19386380903405090
Houssos, N., Stamatis, K., Banos, V., Kapidakis, S., Garoufallou, E., Koulouris, A.: Implementing enhanced OAI-PMH requirements for Europeana. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp. 396–407. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24469-8_40
Houssos, N., Stamatis, K., Koutsourakis, P., Kapidakis, S., Garoufallou, E., Koulouris, A.: Enhanced OAI-PMH services for metadata sharing in heterogeneous environments. Libr. Rev. 63(6/7), 465–489 (2014). https://doi.org/10.1108/LR-05-2014-0051
Kalogeros, E., Gergatsoulis, M., Damigos, M.: Document-based RDF storage method for parallel evaluation of basic graph pattern queries. Int. J. Metadata Semant. Ontol. 14(1), 63 (2020). https://doi.org/10.1504/IJMSO.2020.107798
Karnani, K., et al.: Computational metadata generation methods for biological specimen image collections (2022). https://doi.org/10.21203/rs.3.rs-1506561/v1
Leipzig, J., et al.: Biodiversity image quality metadata augments convolutional neural network classification of fish species (2021). https://doi.org/10.1101/2021.01.28.428644
Leipzig, J., Nüst, D., Hoyt, C.T., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns 2(9), 100322 (2021). https://doi.org/10.1016/j.patter.2021.100322
Mabee, P.M., Balhoff, J.P., Dahdul, W.M., Lapp, H., Mungall, C.J.: Reasoning over anatomical homology in the Phenoscape KB. In: Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA, p. 2 (2018)
Manda, P., Balhoff, J.P., Lapp, H., Mabee, P., Vision, T.J.: Using the phenoscape knowledgebase to relate genetic perturbations to phenotypic evolution. Genesis 53(8), 561–571 (2015). https://doi.org/10.1002/dvg.22878
Manghi, P., Houssos, N., Mikulicic, M., Jörg, B.: The data model of the OpenAIRE scientific communication e-infrastructure. In: Dodero, J.M., Palomo-Duarte, M., Karampiperis, P. (eds.) MTSR 2012. CCIS, vol. 343, pp. 168–180. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35233-1_18
Margaritopoulos, M., Margaritopoulos, T., Mavridis, I., Manitsaris, A.: Quantifying and measuring metadata completeness. J. Am. Soc. Inform. Sci. Technol. 63(4), 724–737 (2012). https://doi.org/10.1002/asi.21706
Michener, W.K.: Creating and managing metadata. In: Recknagel, F., Michener, W.K. (eds.) Ecological Informatics, pp. 71–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-59928-1_5
Mons, B.: Data Stewardship for Open Science: Implementing FAIR Principles, 1 edn. Chapman and Hall/CRC, New York (2018). https://doi.org/10.1201/9781315380711
Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L.O.B., Wilkinson, M.D.: Cloudy, increasingly FAIR; revisiting the FAIR data guiding principles for the European open science cloud. Inf. Serv. Use 37(1), 49–56 (2017). https://doi.org/10.3233/ISU-170824
Nelson, A.: Desirable characteristics of data repositories for federally funded research. Technical report, Executive Office of the President of the United States (2022). https://doi.org/10.5479/10088/113528
Nordling, L.: Scientists struggle to access Africa’s historical climate data. Nature 574(7780), 605–606 (2019). https://doi.org/10.1038/d41586-019-03202-2
Park, J.R.: Metadata quality in digital repositories: a survey of the current state of the art. Catalog. Classif. Q. 47(3–4) (2009). https://doi.org/10.1080/01639370902737240
Park, J.R., Tosaka, Y.: Metadata quality control in digital repositories and collections: criteria, semantics, and mechanisms. Catalog. Classif. Q. 48(8) (2010). https://doi.org/10.1080/01639374.2010.508711
Pepper, J., Greenberg, J., Bakiş, Y., Wang, X., Bart, H., Breen, D.: Automatic metadata generation for fish specimen image collections (2021). https://doi.org/10.1101/2021.10.04.463070
Perez, C.I.: The RDA’s metadata standards directory: information gathering. Master’s thesis, University of North Carolina at Chapel Hill (2013). https://www.rd-alliance.org/sites/default/files/CPerez-RDA-Metadata.pdf
Rettberg, N., Schmidt, B.: OpenAIRE: supporting a European open access mandate. Coll. Res. Libr. News 76(6), 306–310 (2015). https://doi.org/10.5860/crln.76.6.9326
Rockembach, M., Serrano, A.: Climate change and web archives: an Ibero-American study based on the Portuguese and Brazilian contexts. Rec. Manage. J. 31(3) (2021). https://doi.org/10.1108/RMJ-11-2020-0039
Schöpfel, J.: Adding value to electronic theses and dissertations in institutional repositories. D-Lib Mag. 19(3/4) (2013). https://doi.org/10.1045/march2013-schopfe
Soltis, P.S.: Digitization of herbaria enables novel research. Am. J. Bot. 104(9), 1281–1284 (2017). https://doi.org/10.3732/ajb.1700281
Sterner, B., Elliott, S.: The FAIR and CARE data principles influence who counts as a participant in biodiversity science by governing the fitness-for-use of data (2022). http://philsci-archive.pitt.edu/21039/
Tsiflidou, E., Manouselis, N.: Tools and techniques for assessing metadata quality. In: Garoufallou, E., Greenberg, J. (eds.) MTSR 2013. CCIS, vol. 390, pp. 99–110. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03437-9_11
Virkus, S., Garoufallou, E.: Data science from a perspective of computer science. In: Garoufallou, E., Fallucchi, F., William De Luca, E. (eds.) MTSR 2019. CCIS, vol. 1057, pp. 209–219. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36599-8_19
Vlachidis, A., Antoniou, A., Bikakis, A., Terras, M.: Semantic metadata enrichment and data augmentation of small museum collections following the FAIR principles. In: Information and Knowledge Organisation in Digital Humanities, pp. 106–129. Routledge (2021). https://doi.org/10.4324/9781003131816-6
Wieczorek, J., et al.: Darwin core: an evolving community-developed biodiversity data standard. PLoS ONE 7(1), e29715 (2012). https://doi.org/10.1371/journal.pone.0029715
Wilkinson, M.D., et al: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Wong, E.Y.: Data documentation initiative. Tech. Serv. Q. 33(1) (2016). https://doi.org/10.1080/07317131.2015.1093852
Acknowledgments
We thank the Integrated Digitized Biocollections (iDigBio), Global Biodiversity Information Facility (GBIF) and MorphBank data repositories, and the curators of the fish collections in the Great Lakes Invasives Network – Field Museum of Natural History, Illinois Natural History Survey, J. F. Bell Museum of Natural History, Ohio State University Museum of Biological Diversity, University of Michigan Museum of Zoology, and University of Wisconsin-Madison Zoological Museum - for sharing images of their fish specimens with us. We also thank Anuj Karpatne and team at Virginia Tech University who developed and trained the fish feature segmentation ANN component of the workflow, Joel Pepper for automated image quality feature extraction workflow and Bahadir Altintas for developing automated landmark extraction workflow.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jebbia, D., Wang, X., Bakis, Y., Bart Jr., H.L., Greenberg, J. (2023). Toward a Flexible Metadata Pipeline for Fish Specimen Images. In: Garoufallou, E., Vlachidis, A. (eds) Metadata and Semantic Research. MTSR 2022. Communications in Computer and Information Science, vol 1789. Springer, Cham. https://doi.org/10.1007/978-3-031-39141-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-39141-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39140-8
Online ISBN: 978-3-031-39141-5
eBook Packages: Computer ScienceComputer Science (R0)