Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
WikiFactMine	for	Phytochemistry	
Tom	Arrow1,	Charles	Ma;hews1,	Jenny	Molloy1,2,	Ross	Mounce1,2	Peter	Murray-Rust1,3,	Richard	Smith-Unna1,2,	Lars	Willighagen1	
1The	ContentMine,	Cambridge,	CB4	2HY,	2Dept	of	Plant	Sciences,	University	of	Cambridge,	3Dept	of	Chemistry,	University	of	Cambridge	
Mining	the	scienGfic	literature	for	facts	
All	so&ware	(Apache2)	and	Data	(CC0)	are	Open.	h9p://github.com/ContentMine		.	ContentMine.org	is	a	not-for-profit	UK	company.		
We	thank	The	Shu9leworth	FoundaMon	for	a	Fellowship	tp	PMR	and	The	Wikimedia	FoundaMon	for	funding	for	TA	and	CM.	Contact	peter@contentmine.org			
ContentMine	and	Wikidata	
Wikidata		is	“Wikipedia	for	machines”	and	supports	
ContentMine’s	FullContent	search	of	the	Bioscience	literature.	
We	go	beyond	keywords	to	automaGcally	generated	
structured	dicGonaries	with	thousands	of	terms	and	aliases.	
FullContent	means	not	just	words,	but	structured	documents,	
tables	and	diagrams.	We	(and	you)	can	search	the	whole	
literature	(via	EuropePMC	or	Crossref)	every	day	automaMcally	
or	retrospecMvely	for	your	sub-areas	of	interest.		
Example:		
Find	facts	about	terpenes	emi;ed	by	conifers	in	Indonesia.		
We	autogenerate	3	large	dicMonaries	for	all	terpenes,	conifers	
and	Indonesian	place/island	names	in	Wikidata.		
		
IntroducGon	
Understanding	phytochemical	diversity	and	metabolism	can	
answer	many	important	scienMfic	quesMons	and	provide	
economically	important	informaMon;	forming	the	foundaMon	
for	metabolic	engineering	of	plant	compounds.	Phytochemical	
database	resources	exist	but	much	informaMon	on	their	
associaMon	with	species,	enzymes	and	places	without	the	
standardised	format	and	metadata	required	to	enable	machine	
analysis.	In	some	cases	it	is	painstakingly	extracted	manually,	
but	this	approach	is	not	scalable.		
Semi-automated	extracMon	of	phytochemical	data	across	the	
full-text	open	access	literature	is	anMcipated	to	significantly	
extend	previous	abstract-only	coverage.	Here	we	present	an	
open	source	pipeline	and	preliminary	results	for	terpene	data	
mining.	
Reusable	WikiFactMine	DicGonaries.We	expand	the	Wikidata	term	terpene	automaMcally	to	~450	items	(such	as	carvone)	giving	>1000	precise	search	terms	and	data.	Similarly	in	
a	few	seconds	we	can	generate	dicMonaries	of		conifers	(1899);		and	Indonesian	islands	(6344)	making	broad	queries	precise.	
Search	Strategies.	
(A)	Daily	search.	All	new	Open	publicaMons	(300-1000)	on	EuropePMC	are	downloaded	to	WikimediaLabs,	indexed	by	dicMonaries,	and	the	extracted	facts	(dicMonary	
hits)	stored	in	Zenodo	(CERN’s	Open	repository)	.	Each	paper	may	have	hundreds	or	thousands	of	facts.	
(B)	On-Demand.	A	researcher,	especially	those	doing	systemaGc	reviews.		creates	a	fairly	general	query	in	her	field	with	a	range	of	dates,	journals,	etc.	and	downloads	
papers	(getpapers	and	quickscrape)	.	The	papers	are	filtered	locally	with	a	much	more	precise	query	(norma/ami).	
Researcher	FileStore	
Publisher	Sites	
Tidying	(PDF)	
Tagging	
Science	Search	
Data	Search	
AutomaBcally	Extracted	Indexed		Facts		
getpapers	 quickscrape	
DicBonary	Search	
Diseases	 Drugs	
Phytochem	
Species	
Norma/	
ami	
Text	
Figure
s	
Genes	
Dat
a	
Researcher	FileStore	
B	
All	daily	
30,	000	pages/day	
A	
A.	All	EPMC	papers	are	
downloaded	every	day	
and	the	facts	are	
extracted	into	Zenodo	
and	made	publicly	
available.	
B.	Researcher	searches	
repositories	and	also	scrapes	
publisher	sites	for	whatever	
chunk	of	the	literature	she	
wants.	She	runs	local	
dicMonaries		and	saves	the	
results	to	disk	where	they	can	
be	further	analyzed.	She	can	
add	any	papers	she	has	legal	
access	to	and	re-run	
whenever	required.	E.g.	Bag	
Of	Words	is	a	powerful	tool	
for	classifying	papers	
(Bio)chemical	transformaGons	 PhylogeneGcs	
A.	Diagrams	of	Chemical	and	biochemical	
reacMons	can	be	automaMcally	extracted	from	
PDFs	into	the	Researcher’s	filestore.	
B.	PhylogeneMc	trees	can	be	automaMcally	extracted	from	bitmap	
diagrams	or	PDFs,	and	species	names	verified.	Mounce,	Murray-
Rust,	Wills:	h9p://doi.org/10.3897/rio.3.e13589		
Tables	and	graphs	
C.	Tables	and	graphs	can	be	automaMcally	extracted	into	
researcher’s	filestore	and	turned	into	CSV	tables	or	spectra.	
Designed	for	re-use	with	your	favourite	tools	(R,	Python,	etc.)	
INTELLIGENT	QUERIES	
INTELLIGENT	CONTENT

More Related Content

WikiFactMine for Plant Chemistry