WikiFactMine for Plant Chemistry

WikiFactMine for Phytochemistry
Tom Arrow1, Charles Ma;hews1, Jenny Molloy1,2, Ross Mounce1,2 Peter Murray-Rust1,3, Richard Smith-Unna1,2, Lars Willighagen1
1The ContentMine, Cambridge, CB4 2HY, 2Dept of Plant Sciences, University of Cambridge, 3Dept of Chemistry, University of Cambridge
Mining the scienGfic literature for facts
All so&ware (Apache2) and Data (CC0) are Open. h9p://github.com/ContentMine . ContentMine.org is a not-for-profit UK company.
We thank The Shu9leworth FoundaMon for a Fellowship tp PMR and The Wikimedia FoundaMon for funding for TA and CM. Contact peter@contentmine.org
ContentMine and Wikidata
Wikidata is “Wikipedia for machines” and supports
ContentMine’s FullContent search of the Bioscience literature.
We go beyond keywords to automaGcally generated
structured dicGonaries with thousands of terms and aliases.
FullContent means not just words, but structured documents,
tables and diagrams. We (and you) can search the whole
literature (via EuropePMC or Crossref) every day automaMcally
or retrospecMvely for your sub-areas of interest.
Example:
Find facts about terpenes emi;ed by conifers in Indonesia.
We autogenerate 3 large dicMonaries for all terpenes, conifers
and Indonesian place/island names in Wikidata.

IntroducGon
Understanding phytochemical diversity and metabolism can
answer many important scienMfic quesMons and provide
economically important informaMon; forming the foundaMon
for metabolic engineering of plant compounds. Phytochemical
database resources exist but much informaMon on their
associaMon with species, enzymes and places without the
standardised format and metadata required to enable machine
analysis. In some cases it is painstakingly extracted manually,
but this approach is not scalable.
Semi-automated extracMon of phytochemical data across the
full-text open access literature is anMcipated to significantly
extend previous abstract-only coverage. Here we present an
open source pipeline and preliminary results for terpene data
mining.
Reusable WikiFactMine DicGonaries.We expand the Wikidata term terpene automaMcally to ~450 items (such as carvone) giving >1000 precise search terms and data. Similarly in
a few seconds we can generate dicMonaries of conifers (1899); and Indonesian islands (6344) making broad queries precise.
Search Strategies.
(A) Daily search. All new Open publicaMons (300-1000) on EuropePMC are downloaded to WikimediaLabs, indexed by dicMonaries, and the extracted facts (dicMonary
hits) stored in Zenodo (CERN’s Open repository) . Each paper may have hundreds or thousands of facts.
(B) On-Demand. A researcher, especially those doing systemaGc reviews. creates a fairly general query in her field with a range of dates, journals, etc. and downloads
papers (getpapers and quickscrape) . The papers are filtered locally with a much more precise query (norma/ami).
Researcher FileStore
Publisher Sites
Tidying (PDF)
Tagging
Science Search
Data Search
AutomaBcally Extracted Indexed Facts
getpapers quickscrape
DicBonary Search
Diseases Drugs
Phytochem
Species
Norma/
ami
Text
Figure
s
Genes
Dat
a
Researcher FileStore
B
All daily
30, 000 pages/day
A
A. All EPMC papers are
downloaded every day
and the facts are
extracted into Zenodo
and made publicly
available.
B. Researcher searches
repositories and also scrapes
publisher sites for whatever
chunk of the literature she
wants. She runs local
dicMonaries and saves the
results to disk where they can
be further analyzed. She can
add any papers she has legal
access to and re-run
whenever required. E.g. Bag
Of Words is a powerful tool
for classifying papers
(Bio)chemical transformaGons PhylogeneGcs
A. Diagrams of Chemical and biochemical
reacMons can be automaMcally extracted from
PDFs into the Researcher’s filestore.
B. PhylogeneMc trees can be automaMcally extracted from bitmap
diagrams or PDFs, and species names verified. Mounce, Murray-
Rust, Wills: h9p://doi.org/10.3897/rio.3.e13589
Tables and graphs
C. Tables and graphs can be automaMcally extracted into
researcher’s filestore and turned into CSV tables or spectra.
Designed for re-use with your favourite tools (R, Python, etc.)
INTELLIGENT QUERIES
INTELLIGENT CONTENT

WikiFactMine for Plant Chemistry

More Related Content

WikiFactMine for Plant Chemistry