Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
GLOBAL BIODIVERSITY INFORMATION FACILITY WWW.GBIF.ORG Publishing EIA Biodiversity Data: Technology and Infrastructure Vishwas Chavan, Nick King and Francois Rogers Global Biodiversity Information Facility [email_address] Scoping Workshop on Developing an EIA Biodiversity Data Publishing Framework in South Africa 2-4 March 2010, Cape Town, South Africa
Contents EIA Biodiversity Data: Types and formats Data Capture & Digitisation tools Data Discovery Data Publishing Data Quality & fitness-for-use Data Hosting Centers Community Building Platforms
What are the challenges? More data types Richer user interface Better management Richer content Better synchronisation Improved discovery
EIA BIODIVERSITY DATA:  TYPES AND FORMATS
Indices Nomenclators Namebanks Biology Conservation Ecology Distribution Phylogenies ... Geolocation Country Collector Date … Voucher specimen Blood sample DNA Barcode Image Audio Video ... BHL Plazi.org ... EIA Biodiversity data are very diverse Evidence Metadata Taxon names Taxon concepts Observation Literature Species banks
DATA CAPTURE AND DIGITISATION TOOLS
Data Capture and Digitisation Tools Florin Pandora Taxis Cassia FieldNote Mandala ATTA BirdRecorder
uBio Tools Name recognition tool (FindIT) Author abbreviation resolver Checking classification (TSN name mapper) Deconsrtuct scientific name (ParseIT) Find scientific name (CrawlIT) etc… http://www.ubio.org
GBIF Templates Capture data in DwC compatible format Occurrence Data Template Names Data Template Facilitate authoring ’resource metadata’ Occurrence template Documentation for occurrence template
GBIF Informatics Architecture Improved access to Names, Metadata  and Primary Biodiversity  Data Distributed GBIF  informatics architecture Faster and easier  publishing of data
DATA DISCOVERY GBRDS REGISTRY METADATA CATALOGUE GBRDS: Global Biodiversity Resources Discovery System
DATA DISCOVERY: GBRDS REGISTRY
GBRDS, a Discovery System Consumers Data Publishers Searching Retrieving Discovering Discovery System Registering Service Publishers Others…
That links to resources… Who? Institutions, Collections … What? Where? When? How Data, Services, GUID/LSID… Location, Access points… Temporal Scope… Formats, protocols, qualities A distributed service ………… .. which resolves to information resources … ./
Global Biodiversity Resources Discovery System Institutions/Collections LSIDs/DOI/GUIDs  Standards Protocols Resources Services/Applications etc…
Global Biodiversity Resources Discovery System Institutions/Collections LSIDs/DOI/GUIDs  Standards Protocols Resources Services/Applications etc… GBRDS Registry Release: April 2010
DATA DISCOVERY:  METADATA CATALOGUES
User Perspective Data Producer   Perspective Document data with minimum effort Assess the value of the data for others Bridge the gap between data owners and users Educate users about the characteristics of the data Craglia: http://www.ec-gis.org/Workshops/6ec-gis/papers/craglia-metadata.doc Two perspectives on metadata Discover if data exists Identify source, provenance Make judgement about data quality and usability before getting it Minimise costs involved in the search, retrieval, integration and use of the data
Two levels of metadata Discovery Metadata Full Metadata Discover if a resource exists; get information on - Ownership Location How to get further information Provides a full description of the resource, including - Data quality Data lineage Full access and exploitation
Natural Collections Descriptions (NCD) Ecological Metadata Language (EML) ISO 19115/19139 FGDC Biological Data Profile Metadata Standards Dublin Core MRTG Multimedia Metadata Schema IPT 1.1 Metadata Profile
DATA PUBLISHING
Key Components: the IPT IPT The Integrated Publishing Toolkit is a state-of-the-art tool to simplify the  mobilisation of biodiversity information resources such as Names, Metadata and primary biodiversity data Data Publisher Registration (GBRDS) + Publishing of Names, Metadata, Primary biodiversity data etc…
Simple process! The Integrated Publishing Toolkit (IPT) is designed to simplify the mapping, indexing and harvesting of Names, Metadata and Primary Biodiversity Data!
GBIF Integrated Publishing Toolkit (IPT) Open source Java web application  Bypasses limitations of traditional wrapper tools in publishing large amounts of data by publishing whole datasets in DwC-Archive dumps (especially useful for small data publishers or those with little or no internet access) Has a richer environment than current wrapper tools, providing some data cleaning, visualisation capabilities, and the ability to publish dataset metadata Documentation and download http:// code.google.com/p/gbif-providertoolkit/ Demo site   http://ipt.gbif.org
* Darwin Core (Text-Archive) based on standard submitted to TDWG for review Feb 2009 IPT Publishes Through… More to come….
IPT Demo Screencast of IPT demo GBIF Help Desk (helpdesk@gbif.org) IPT 1.1 Release: April 2010
NAMES DATA
Scope of the Global Names Architecture Referencing names in Checklists to a common Nomenclatural Index
Checklist Bank  –  A Name Services brokerage Global broker of taxonomic data  Index of Taxonomic Catalogues and Annotated Checklists Extends the GBIF network to support publishing Species-level data
Publishing Checklists to GBIF Using Integrated Publishing Toolkit Via pre-composed Spreadsheet templates Exporting according to DwC Archive format and registering a local data file (self-serve) GBIF desktop publishing tool Other taxonomic editors (EDIT/ITIS) that support DwC Archive format
Desktop Annotated Checklist Builder Create, manage, publish Synonymised checklists Vernacular Names Distribution data Bibliography Type/Specimen data Mac OS/ Windows Publishes “GBIF-ready” format DwC Archive – simple, extensible  Text-based format Q3 2010
Controlled Vocabularies Server ISO: Countries ISO: Language DwC: Basis of Record DwC: Nomenclatural Status DwC: Sex (Gender) DwC: Taxonomic Status IUCN: Threat Status … v ocabularies.gbif.org Vocabularies publishing platform  –  Internationalise all GBIF vocabularies
Controlled Vocabularies Server Create, manage, publish Extensions to Darwin Core Extend Occurrence Data Extend Species Data v ocabularies.gbif.org Tie to vocabularies that are also drafted and published to this system.  Then translate to your native langauge..
DATA QUALITY &  FITNESS-FOR-USE
Fitness-for-use Primary biodiversity data can be used for multiple purposes by various user communities worldwide.  Assessing and enhancing fitness-for-use of data is therefore critical for the scientific and social relevance of biodiversity science. Fitness-for-use varies from one use case to another..... Data quality assessment and quality control are important components of ‘fitness-for-use’ regime
Loss of Data Quality At the time of collection During digitisation During documentation During storage and archiving During analysis and manipulation During dissemination and presentation Through the use to which they are put
Issues influencing data quality Accuracy and precision Completeness Currency and Timeliness Update frequency Consistency Flexibility Transparency Performance measures and targets Data cleaning Outliers setting targets for improvement Truth in labelling Error and bias Uncertainty Auditability Edit Controls Minimise duplication and reworking of data Maintenance of original (or verbatim) data Categorisation can lead to loss of data and quality Documentation Feedback Education and Training  Accountability
Data quality: Responsible Players Collectors Custodian or Curator Aggregator  Publisher Users
Data Cleaning: definition & framework A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions General framework for data cleaning Define and determine error types Search and identify error instances Correct the errors Document error instances and error types; and  Modify data entry procedures to reduce future errors
Tools and Best Practices http://mapstedi.colorado.edu/ http://manisnet.org/GeorefGuide.html
Tools and Best Practices GBIF Templates
Best Practice Guidelines All freely available
Best resource… Chapters on  Data Quality Data Cleaning Geo-referencing Generalising sensitive data http://www2.gbif.org/TM1.pdf
DATA HOSTING CENTERS
Data Hosting Centers Caters to data publishers without skills & resources Facilitate long term archival and publishing GBIF Plans Criteria for establishing DHC Criteria for endorsement of DHC Tools and Best Practices for DHC
Data Hosting Centers
COMMUNITY BUILDING PLATFORMS
http://community.gbif.org
? Email:  [email_address] Skype:  vishwaschavan

More Related Content

EIA Biodiversity Data Mobilisation

  • 1. GLOBAL BIODIVERSITY INFORMATION FACILITY WWW.GBIF.ORG Publishing EIA Biodiversity Data: Technology and Infrastructure Vishwas Chavan, Nick King and Francois Rogers Global Biodiversity Information Facility [email_address] Scoping Workshop on Developing an EIA Biodiversity Data Publishing Framework in South Africa 2-4 March 2010, Cape Town, South Africa
  • 2. Contents EIA Biodiversity Data: Types and formats Data Capture & Digitisation tools Data Discovery Data Publishing Data Quality & fitness-for-use Data Hosting Centers Community Building Platforms
  • 3. What are the challenges? More data types Richer user interface Better management Richer content Better synchronisation Improved discovery
  • 4. EIA BIODIVERSITY DATA: TYPES AND FORMATS
  • 5. Indices Nomenclators Namebanks Biology Conservation Ecology Distribution Phylogenies ... Geolocation Country Collector Date … Voucher specimen Blood sample DNA Barcode Image Audio Video ... BHL Plazi.org ... EIA Biodiversity data are very diverse Evidence Metadata Taxon names Taxon concepts Observation Literature Species banks
  • 6. DATA CAPTURE AND DIGITISATION TOOLS
  • 7. Data Capture and Digitisation Tools Florin Pandora Taxis Cassia FieldNote Mandala ATTA BirdRecorder
  • 8. uBio Tools Name recognition tool (FindIT) Author abbreviation resolver Checking classification (TSN name mapper) Deconsrtuct scientific name (ParseIT) Find scientific name (CrawlIT) etc… http://www.ubio.org
  • 9. GBIF Templates Capture data in DwC compatible format Occurrence Data Template Names Data Template Facilitate authoring ’resource metadata’ Occurrence template Documentation for occurrence template
  • 10. GBIF Informatics Architecture Improved access to Names, Metadata and Primary Biodiversity Data Distributed GBIF informatics architecture Faster and easier publishing of data
  • 11. DATA DISCOVERY GBRDS REGISTRY METADATA CATALOGUE GBRDS: Global Biodiversity Resources Discovery System
  • 13. GBRDS, a Discovery System Consumers Data Publishers Searching Retrieving Discovering Discovery System Registering Service Publishers Others…
  • 14. That links to resources… Who? Institutions, Collections … What? Where? When? How Data, Services, GUID/LSID… Location, Access points… Temporal Scope… Formats, protocols, qualities A distributed service ………… .. which resolves to information resources … ./
  • 15. Global Biodiversity Resources Discovery System Institutions/Collections LSIDs/DOI/GUIDs Standards Protocols Resources Services/Applications etc…
  • 16. Global Biodiversity Resources Discovery System Institutions/Collections LSIDs/DOI/GUIDs Standards Protocols Resources Services/Applications etc… GBRDS Registry Release: April 2010
  • 17. DATA DISCOVERY: METADATA CATALOGUES
  • 18. User Perspective Data Producer Perspective Document data with minimum effort Assess the value of the data for others Bridge the gap between data owners and users Educate users about the characteristics of the data Craglia: http://www.ec-gis.org/Workshops/6ec-gis/papers/craglia-metadata.doc Two perspectives on metadata Discover if data exists Identify source, provenance Make judgement about data quality and usability before getting it Minimise costs involved in the search, retrieval, integration and use of the data
  • 19. Two levels of metadata Discovery Metadata Full Metadata Discover if a resource exists; get information on - Ownership Location How to get further information Provides a full description of the resource, including - Data quality Data lineage Full access and exploitation
  • 20. Natural Collections Descriptions (NCD) Ecological Metadata Language (EML) ISO 19115/19139 FGDC Biological Data Profile Metadata Standards Dublin Core MRTG Multimedia Metadata Schema IPT 1.1 Metadata Profile
  • 22. Key Components: the IPT IPT The Integrated Publishing Toolkit is a state-of-the-art tool to simplify the mobilisation of biodiversity information resources such as Names, Metadata and primary biodiversity data Data Publisher Registration (GBRDS) + Publishing of Names, Metadata, Primary biodiversity data etc…
  • 23. Simple process! The Integrated Publishing Toolkit (IPT) is designed to simplify the mapping, indexing and harvesting of Names, Metadata and Primary Biodiversity Data!
  • 24. GBIF Integrated Publishing Toolkit (IPT) Open source Java web application Bypasses limitations of traditional wrapper tools in publishing large amounts of data by publishing whole datasets in DwC-Archive dumps (especially useful for small data publishers or those with little or no internet access) Has a richer environment than current wrapper tools, providing some data cleaning, visualisation capabilities, and the ability to publish dataset metadata Documentation and download http:// code.google.com/p/gbif-providertoolkit/ Demo site http://ipt.gbif.org
  • 25. * Darwin Core (Text-Archive) based on standard submitted to TDWG for review Feb 2009 IPT Publishes Through… More to come….
  • 26. IPT Demo Screencast of IPT demo GBIF Help Desk (helpdesk@gbif.org) IPT 1.1 Release: April 2010
  • 28. Scope of the Global Names Architecture Referencing names in Checklists to a common Nomenclatural Index
  • 29. Checklist Bank – A Name Services brokerage Global broker of taxonomic data Index of Taxonomic Catalogues and Annotated Checklists Extends the GBIF network to support publishing Species-level data
  • 30. Publishing Checklists to GBIF Using Integrated Publishing Toolkit Via pre-composed Spreadsheet templates Exporting according to DwC Archive format and registering a local data file (self-serve) GBIF desktop publishing tool Other taxonomic editors (EDIT/ITIS) that support DwC Archive format
  • 31. Desktop Annotated Checklist Builder Create, manage, publish Synonymised checklists Vernacular Names Distribution data Bibliography Type/Specimen data Mac OS/ Windows Publishes “GBIF-ready” format DwC Archive – simple, extensible Text-based format Q3 2010
  • 32. Controlled Vocabularies Server ISO: Countries ISO: Language DwC: Basis of Record DwC: Nomenclatural Status DwC: Sex (Gender) DwC: Taxonomic Status IUCN: Threat Status … v ocabularies.gbif.org Vocabularies publishing platform – Internationalise all GBIF vocabularies
  • 33. Controlled Vocabularies Server Create, manage, publish Extensions to Darwin Core Extend Occurrence Data Extend Species Data v ocabularies.gbif.org Tie to vocabularies that are also drafted and published to this system. Then translate to your native langauge..
  • 34. DATA QUALITY & FITNESS-FOR-USE
  • 35. Fitness-for-use Primary biodiversity data can be used for multiple purposes by various user communities worldwide. Assessing and enhancing fitness-for-use of data is therefore critical for the scientific and social relevance of biodiversity science. Fitness-for-use varies from one use case to another..... Data quality assessment and quality control are important components of ‘fitness-for-use’ regime
  • 36. Loss of Data Quality At the time of collection During digitisation During documentation During storage and archiving During analysis and manipulation During dissemination and presentation Through the use to which they are put
  • 37. Issues influencing data quality Accuracy and precision Completeness Currency and Timeliness Update frequency Consistency Flexibility Transparency Performance measures and targets Data cleaning Outliers setting targets for improvement Truth in labelling Error and bias Uncertainty Auditability Edit Controls Minimise duplication and reworking of data Maintenance of original (or verbatim) data Categorisation can lead to loss of data and quality Documentation Feedback Education and Training Accountability
  • 38. Data quality: Responsible Players Collectors Custodian or Curator Aggregator Publisher Users
  • 39. Data Cleaning: definition & framework A process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions General framework for data cleaning Define and determine error types Search and identify error instances Correct the errors Document error instances and error types; and Modify data entry procedures to reduce future errors
  • 40. Tools and Best Practices http://mapstedi.colorado.edu/ http://manisnet.org/GeorefGuide.html
  • 41. Tools and Best Practices GBIF Templates
  • 42. Best Practice Guidelines All freely available
  • 43. Best resource… Chapters on Data Quality Data Cleaning Geo-referencing Generalising sensitive data http://www2.gbif.org/TM1.pdf
  • 45. Data Hosting Centers Caters to data publishers without skills & resources Facilitate long term archival and publishing GBIF Plans Criteria for establishing DHC Criteria for endorsement of DHC Tools and Best Practices for DHC
  • 49. ? Email: [email_address] Skype: vishwaschavan

Editor's Notes

  1. -to GBIF network, and for reuse by others as well
  2. -Nick mentioned the key challenges in his presentation entitled “Why the need for a global infrastructure to discover, share, publish and use biodiversity data
  3. -Nick mentioned the key challenges in his presentation entitled “Why the need for a global infrastructure to discover, share, publish and use biodiversity data