10 years of global biodiversity databases: are we there yet?

10 years of global biodiversity
databases:
are we there yet?
Tony Rees
Independent data consultant,
Northern Rivers region, New South Wales,
Australia
previously: CSIRO Marine &
Atmospheric Research, Hobart, Tasmania
Global ocean bio-records in OBIS, 2015

The vision: “Biodiversity information on every
desktop” [ / device]…
 A global taxon inventory
 up-to-date species lists,
synonymies, etc. (for all groups)
 Citations, links to primary
literature
 direct access to the primary
taxonomic literature (for all
described taxa), including full text
(preferably…)
 “All” georeferenced records
accessible, for all species
 no need for individuals to
do the data aggregation
 map local / regional / global
records, show details for any
data item
 Indexes of taxon traits
 e.g. to support sort /
filter / group by…
 Predictive mapping / computed
range maps for all taxa
 fill sampling gaps via niche
modelling, to produce
comprehensive global species maps  Plus more (phylogenies,
illustrations, genetics,
descriptions, keys…)

A standardised approach for this talk
 Rationale for each component/
activity (why do we care?)
 Size of the problem (or sub-problem)
 Who is addressing it (and what they are
doing)
 How far have they (“we”) got, and how
much is still to be done…
Some other points to consider:
• open vs. closed access to relevant content (who can access?)
• machine vs. human retrievability ( -> services, not just pages to view)
• degree of consolidation available (saves querying multiple resources)
• web query only, or are the base data available for export/user upload

A global taxon inventory
From presentation by Quentin Wheeler, International Institute for Species Exploration (IISE):

A global taxon inventory – why do we care?
 Useful to know with what organisms we share the planet
 History of life as its own study area, also key to understanding
present life forms and their relationships
 Ensure taxonomic names do not get accidentally re-used
 Construct list once, use many times (no need to re-create
from scratch)
 Reconcile old names / synonyms to current taxon concepts
(important for data integration)
 Provide “taxonomic backbone” underpinning other
biodiversity activities / projects.

How many species? Arthur Chapman asked the experts:

From Chapman’s summary document (2009 edition):
NB “Others” includes non-green
algae, Protista, prokaryotes and
viruses (refer document for details)
“Invertebrates” includes 1m insects,
360k others (incl. 102k arachnids, 85k
molluscs, 47k Crustacea)
“Estimated” total spp. for world is
11.3m i.e. only 17% of estimated
global biodiversity yet named (!)

…add another 200k-300k(?) for known fossil species, maybe
multiply 2x-3x to include synonyms…
…gives upwards of 5m species names to
catalogue/organise (+0.5m genera…)
+ new species descriptions (~20k/year) and higher taxa,
also new combinations (??/year)
5m+
names!

From group estimates to global lists

Likes:
• Comprehensive coverage (“most” zoological names held)
• Includes fossils as well as extant taxa
• Cites original publications for most post-1860 names
• Low latency (c. 6 months to name appearance in index)
• Some tax. hierarchy for all names
• ION ID minted for every name (usefulness varies)
Dislikes:
• Many more names than taxa (spelling + authority variations, synonyms, bad
data) – needs deduplication before use
• Hard to work out which is “correct” name or what names are synonyms, etc.
• Some quirks in citations as given (including author spellings)
• Detailed publication and taxon info is behind paywall
ION – Index to Organism Names
www.organismnames.com
2005: 1.8m names, all ranks (incl. synonyms), from 2.8m documents
2015: 5.2m names, all ranks (incl. synonyms), from 3.5m documents
- Animal names only (+ protists), cf. other resources for plants etc.

Newly published names (all ranks) in ION,
Nov 2015 (total 2.01m)
names from Index Animalium
(1758-1850)
names from Zoological Record
(1864-current)

Catalogue of Life
www.catalogueoflife.org
2005: 530k valid species names + ?? synonyms), from 23 databases
2015: 1.6m valid species names + 1.3m synonyms), from 151 databases
- All taxonomic groups, extant only (a few fossils starting 2015)
Likes:
• Name quality high (all expert-supplied), synonomies explicit, no (few) duplicates
• Internally consistent taxonomic hierarchy, kingdom -> family
• Coverage increasing over time (claims currently 84% of all extant species)
Dislikes:
• Some groups not yet covered (also no fossils)
• Synonymies not always complete (some old names not listed)
• No author, synonym information at ranks above species
• No links to original literature (although these may be traceable via source databases)
• More latency than ION (takes a while for new names to appear)
• No stable IDs for names (cannot use for linking to current edition)

Partial ION listing – search for “Physeter”

Partial ION listing – search for “Physeter”
Note: 1. ION often includes the same name in multiple variants (mix of “good” and
“bad” content), giving over-representation of number of “real” names
2. This is a list of names, not taxa (single taxon can have multiple names, e.g.
valid name plus synonyms – not distinguished in ION).

Equivalent Cat. of Life search result for “Physeter”:

Selected other names/taxon databases of note
PaleoBioDB (fossils) – formerly PaleoDB
• 2005: 60k names, all ranks
• 2015: 320k names, all ranks (incl. synonyms)
– Good coverage of many fossil taxa (most groups)
World Register of Marine Species (WoRMS)
• 2007: first release, 75k valid species + ?? syns
• 2015: 230k valid species + 96k synonyms
– Excellent coverage of marine taxa (almost all
groups), incl. some fossils
Interim Register of Marine and Nonmarine Genera (IRMNG)
– Tony Rees / OBIS project
• 2006: first release, 159k genus names incl. synonyms
(the latter partly known, part not)
• 2015: 488k genus names incl. synonyms (also 1.9m
species names incl. synonyms)
– Comprehensive genus level coverage of all groups,
extant + fossil, not all assigned to family as yet

For other groups (examples, NB completeness varies)
(etc., etc.)

Linking names to the literature

Linking names to the literature – why do we
care?
 Initial publication / description / designated type is “anchor” for
every taxonomic name and concept
 Use to verify “indexing” details (taxon name + author, year) are
correctly represented
 Included text details (title < abstract < full text) can be “mined”
to extract information useful for indexing (or just reading)
 Entry point to wider literature via refs. list, subsequent citations,
etc.

E.g. from this…
to this
(taxon name + author,
year)
(article full
text)

Genus #1 in IRMNG: example “minimal” citation styles
Genus Authority Microcitation
Aa
Aa Baker,
1940
Aa Baker,
1940
Bull. Bishop
Mus., 165, 107

Genus #1 in IRMNG: example “better” (=standard) citation style
Genus Authority Microcitation Full citation
Aa
Aa Baker,
1940
Aa Baker,
1940
Bull. Bishop
Mus., 165, 107
Aa Baker,
1940
Baker, H.B., 1940. Zonitid
snails from Pacific Islands.
Part 2.-Hawaiian genera of
Microcystinae. Bulletin
Bishop Museum Honolulu,
165: 105-201.
ION has a subset of these
(article title, citation only)

Genus #1 in IRMNG: example “best” citation style with online links
Genus Authority Microcitation Full citation Online link
(abstract)
Online link
(full text)
Aa
Aa Baker,
1940
Aa Baker,
1940
Bull. Bishop
Mus., 165, 107
Aa Baker,
1940
Baker, H.B., 1940. Zonitid
snails from Pacific Islands.
Part 2.-Hawaiian genera of
Microcystinae. Bulletin
Bishop Museum Honolulu,
165: 105-201.
Aa Baker,
1940
Baker, H.B., 1940. (etc.) http://...
(or DOI)
Aa Baker,
1940
Baker, H.B., 1940. (etc.) http://...
(or DOI)
ION has a subset of these
(article title, citation only)
BioNames
(R. Page project)
has some of these

Online access to scientific literature – 1
Q.: How many articles in the “scientific literature”?
A.: Guesstimate might be 180m total “scholarly articles”, 120m in all
sciences, 20m in biology over past 250 years
 Google Scholar: ~160m citations (all disciplines)
 Web of Science: 90m items indexed (1900 onwards)
 PubMed: 24m records (mostly 1966 onwards)
 Biological Abstracts: 12m records, 1926 onwards (includes some
non-journal material)
 Ideally would like single master list, unique ID/hyperlink for each work
(article/chapter/book etc.)
 DOI (Digital Object Identifier) system / CrossRef introduced in 2000,
good for newly published work
 currently used for 114m “objects” (incl. some retrospective allocation;
NB not all are scientific literature)
 “Publishers use CrossRef's tools to convert citations from dumb
strings to useful links” (quote from R. Page discussion post)

 Zoological Record has indexed 3.5m works in zoology 1864-
current (increasing at 70k/year, 1.5k/week), but individual
records are behind paywall

 Biodiversity Heritage Library (BHL) is scanning older literature (esp.
pre-1923) and placing online
 limited subset indexed by article title, otherwise (all) indexed by journal
and page no. (then has BHL page ID – can link to that)
 search can be initiated by journal title, volume + page (if already
known)
 can also search by taxon scientific name – but some instances will be
missed (BHL OCR [optical character recognition] is less than 100%
reliable)
 this author’s experience looking for initial publication instances of
older names – success in around 1/3 of cases (not too bad), however
requires manual search (time consuming)
 ideally, original description page links should be compiled somewhere
for others to re-use (not currently done on any scale)

BHL sample page: American Journal of Science s4 v15 (1903) p. 312
(original description of Megablattina Sellards, 1903, a cockroach)

 More recent literature – mix of publisher websites and operations like
JSTOR, often behind paywalls (though abstracts typically not so) – but not
all yet available digitally (BHL also has some post-1922 content)
 Subscription/abstracting services (Zoological Record, Web of Science,
etc.) have better coverage, but are often not open access for viewing or
external linking purposes (although PubMed is)
 Some tools constructed around planned all-encompassing “Bibliography
of Life” project (from Europe, http://biblife.org/), but progress difficult to
gauge as yet (claims 215k references held); another European project:
GRIB (Global References Index to Biodiversity), however development
appears to have stopped…
…
claims 215k references held

In summary: online [open] access available to subsets of article titles
> abstracts > full text in decreasing proportions
 No single comprehensive source of online refs. available at
this time, users must “mix and match” sources as available
 Few direct links in current tax. databases to literature that is
online (some noteworthy exceptions)
 Over 95% of taxonomic literature pre-dates year 2000 starting
point for DOIs
 Most comprehensive indexes are currently commercial
products (behind paywalls), not much traction in “community
/ open access” equivalents as yet.

Machine-readable sets of taxon traits – why do
we care?
 Powerful tools for automated subsetting / filtering out sets of
interest
 Useful for data quality assurance (e.g. flag suspect data, fix
logical inconsistencies)
 Can form the basis of auto-response “expert systems” / keys
e.g. as already available for specialised groups
 Need for standardised vocabularies/ semantics for indexing
terms, units used, etc.

 Operations like OBIS (Ocean Biogeographic Information
System) want to display only (e.g.) marine + extant taxa,
suppress others
 No “trait bank” systems existed at that time, IRMNG was
created to fill this need: flag taxa as extant/fossil,
marine/nonmarine
 IRMNG data & flags subsequently incorporated into other systems
e.g. WoRMS, ALA, OTOL, EOL, more… – IRMNG flags are ~70%
complete at genus level, 95%+ for species
 EOL (Encyclopedia of Life) is establishing “TraitBank” (2014 on)
to capture similar traits + more

EOL TraitBank most populous content (Oct 2015)
Note, EOL is an aggregator, not an original content generator (relies on
content supplied by third parties)

EOL TraitBank most populous content (Oct 2015)
EOL traits recently
added to Google
search, Nov 2015

Room for further development in this area…
 e.g. TDWG (Taxonomic Databases Working Group) had active
interest in development of “SPM” (Species Profile Model)
around 2007-8, seems a bit quiet since
 character matrices stored in computer-based keys e.g. Lucid,
DELTA, etc. could presumably be leveraged in some cases
 some domains already well covered in standard manner (e.g.
FishBase for 33k fishes, SeaLifeBase for 71k non-fish marine
taxa)
 SeaLifeBase example shown in next slide…

10 years of global biodiversity databases: are we there yet?

Displayed range
data is from
AquaMaps holdings,
see later in this
presentation

Assembling georeferenced species
data

Assembling georeferenced species data – why
do we care?
 “Where” is as important as “what” in biodiversity studies
 Central repository much easier point of access than
thousands/millions of distributed sources
 See gaps in existing data holdings / state of current data sampling,
digitisation and mobilisation
 Overlay spatial distributions with other layers e.g. country
boundaries, habitats, environmental variables – generate regional
lists, understand controlling factors
 Spot bad data (appearing in unlikely places on the map)
 Use for spatial analysis (geography as computable data).

Distributed data networks
 First data networks in USA, late 1990s – VertNET, HerpNET, ORNIS –
connecting museum data (vertebrate specimen records) in participating
agencies (also in Australia: Australian Virtual Herbarium)
 OBIS (2002 on) and GBIF (2004 on) provide gateways to both specimen
and observation data from multiple agencies wordwide
 OBIS (marine species records only):
 2005: 5.6m records from 38 data sources (40,700 species)
 2015: 44.9m records from 1,916 data sources (147,000 species)
 GBIF (all habitats):
 2005: 45m records from 334 data sources (?? species)
 2015: 577m records from 15,196 data sources (?? species)
 OBIS data flows into GBIF (though with some issues), also into local
networks e.g. ALA (Atlas of Living Australia)

Building OBIS – 2002-5
(trying to make a working system, and provide a good user experience)
“OBIS v2” front page /
spatial search interface, 2005

Current OBIS sample map & data
OBIS records for Physeter macrocephalus (sperm whale) in Australian
region, Oct 2015 (51,756 global records)

GBIF sample map & data
GBIF records for Physeter macrocephalus in Australian region, Oct 2015
(34,436 global records)

ALA (Atlas of Living Australia) presentation of records for Physeter macrocephalus in
Australian region, Oct 2015

How complete are holdings of GBIF, OBIS, etc.?
 From Hill et al., 2012 paper: at least 1 bn – 2 bn specimens in
biological collections worldwide (not all currently digitised)
 Observations probably outnumber specimens by 100x - 1000x
 Gives maybe 500 bn potential records +/- ; GBIF has 0.5 bn to
date (0.1%)…
 Not all records are of equal importance for initial studies of
distributions (much redundancy), maybe OBIS/GBIF have <5% of
most useful records at this time…
 Existing holdings presently heavily skewed towards better
sampled/accessible areas, also regions where digitisation is more
advanced
 True “target numbers” difficult to assess (every individual of every
species, or what?)

Predicted distributions (environmental niche
modelling)
From AquaMaps poster for e-biosphere conference, London, 2009

Predicted distributions (environmental niche
modelling) – why do we care?
 Available georeferenced data are always incomplete, need a
mechanism to intelligently fill in data gaps, produce more
complete biodiversity maps & atlases
 Move from hand drawn maps / non-digital “expert knowledge”
to computable data
 Model potential spread of invasives into new areas (show
suitable habitat)
 Model potential changes in species range in response to
changing climate or other factors
 Facilitate better understanding of broad- (and fine-) scale
factors controlling species distributions.

Niche modelling concept
 Range of methodologies available including MAXENT, GARP, simple niche
models e.g. Relative Environmental Suitability (RES)
 Ready et al., 2010 (incl. Tony Rees) contend that simple methods work as
well as more complex ones:
Source: A. Guisan group web page, Université de Lausanne, Switzerland
http://www.unil.ch/idyst/en/home/menuinst/research-poles/geoinformatics-and-spatial-m/predictive-biogeography/
advancing-the-science-of-eco.html

J. Ready et al., Ecological Modelling, 2010

Global niche modelling/mapping projects
Lifemapper: Kansas University, c. 2003 onwards
• Models terrestrial niches (?only)
• No. of maps unclear (claims >100,000 species with data,
perhaps only a subset with maps)
• Uses GARP modelling (computationally intensive, several
hours per species map?), no expert review
** FW variables: elevation, surface temperature, net primary productivity, soil pH, soil
moisture, soil organic carbon, precipitation, compound topographic index
* Marine variables: bottom depth, water temperature (SST/bottom), salinity, primary
production, sea ice concentration, distance to land
AquaMaps: Kiel Marine Lab (+ co-developers), 2006 onwards
• Models marine niches only (plus some freshwater)
• 22,000 species mapped by Nov 2015 (incl. ~600 FW), mainly
fishes
• Uses RES modelling (6*/8** environmental variables, <2 mins
per species map) plus geographic partitioning and expert review

Lifemapper team at KU (Kansas University, USA)

Lifemapper example map
Lifemapper example map for milk or Spanish snail (Helix lactea, now = Otala lactea)
(yellow dots are data points, red is potential habitat)

Building AquaMaps – 2005
(trying to make a working system, and the models fit the data…)

AquaMaps example map for New Zealand sea lion (Phocarctos hookeri)
(without expert review): Data points used

(without expert review): Computed AquaMap

(without expert review): All suitable habitat

(without expert review) : All suitable habitat (detail)
Detail (square size
= 50 km nominal
for global
coverage)

AquaMaps example maps for New Zealand sea lion (Phocarctos hookeri)
(without expert review): Current vs. computed year 2100 range
2010 2100

from http://us.whales.org/species-guide/sperm-whale
From hand drawn map to computable data…
(this example: Physeter macrocephalus)

from http://us.whales.org/species-guide/sperm-whale
Aquamaps map (including probability of
occurrence by 0.5 degree square)
From hand drawn map to computable data…
(this example: Physeter macrocephalus)

AquaMaps summary plot – all marine species overlaid (22,800 as at Nov 2015)

AquaMaps summary plot – all marine species overlaid (22,800 as at Nov 2015)
Click on the map to generate
“what lives here” queries…

AquaMaps spatial query result, November 2015:

Any primary data to back this up? E.g. search in Atlas of
Living Australia…

Yes, there ARE whale sharks on NSW central coast…

Niche modelling is a very current discussion topic: e.g. GBIF/AMNH symposium
4 November 2015 (no proceedings as yet)

So – “are we there yet?”

Component 2005 2015 Status (/5)
Global taxon inventory – all species
names (with synonyms)
25%? 60%+ nnn(n)
All names linked to the literature
(original descriptions), at least
minimally
5%? 10-20%? n(n)
Taxon traits databased, in machine-
addressable form
0 10%+? n(n)
Distribution data (specimens,
observations) in online systems
<1% 5%+? n
Predicted distributions/global
range maps for all taxa
0?
5%+?
(fishes 60%+) n(n)
A report card to date…

 Take home message: progress is definitely being made,
however plenty still to do:
 Complete master names lists, release as open data (also deal
with inflow of new names and taxonomic dynamism)
 Improve online access to tax. literature (plus embedded links
from relevant databases)
 More data into OBIS & GBIF (including datasets not yet
digitised)
 More progress on
predictive mapping
(algorithms, base data,
habitat factors, species
covered).

Thank
you!Tony Rees Tony.Rees@marinespecies.org
◦ CSIRO Marine Research applications developer 1998-2014 including CAAB
(Codes for Australian Aquatic Biota), c-squares and Taxamatch
◦ OBIS steering committees (various) / system developer 2002-2005
◦ AquaMaps project co-developer 2004-current
◦ IRMNG developer 2006-current
◦ OBIS Australia Node manager 2006-2014
◦ Global Names Project collaborator 2006-current
◦ WoRMS contributor 2007-current
◦ GBIF & Open Tree of Life collaborator 2010-current
◦ iPlant collaborator 2010-2013
◦ Atlas of Living Australia consultant 2010-2012
◦ Catalogue of Life global team member 2010-2012
◦ GBIF Ebbe Nielsen Prize (for excellence in Biodiversity Informatics) winner 2014.
This talk available at: www.slideshare.net/tony1212/presentations

10 years of global biodiversity databases: are we there yet?

More Related Content

10 years of global biodiversity databases: are we there yet?