Text Mining Research Papers PDF
Text Mining Research Papers PDF
Text Mining Research Papers PDF
Research Papers
Petr Knoth
Phil Gooch
Mendeley
22 September 2015
What is text mining?
‘the process of deriving high-quality information from text’ (Wikipedia)
Named Entity
Relation extraction
Recognition (NER)
Semantic Analysis
search, Knowledge and
reasoning, ... base visualisation
Example 1: Literature Based Discovery
● A range of techniques (Smalheiser, 2012)
● A typical approach: ABC method (e.g. Hristovski et. al. 2008):
○ A affects/binds/regulates/interacts with B
○ B affects/binds/regulates/interacts with C
○ A and C are not explicitly linked in any article
=> There might be an undiscovered relationship between A and C
C
Recognise entities, Term and
such as symptoms, relation
Ranker
Lists B and C conditions, therapeutic extraction
with semantic agents (drugs) List of terms
annotations
Knowledge
base
Ranked list of
hypotheses
connecting A
and C
Medical
researcher
Example 1: Visualising the A-C
hypotheses
Source: Herrmannova & Knoth (2012) Visual Search for Supporting Content Exploration in Large Document
Collections, DLib 18(8).
How can I get started?
• Get the data
• Design/build your workflow
• Select a framework, tools, services to be used in implementing the
workflow
• Understand how to evaluate the performance of each component
Full-text Open Access article sources
• Subject repositories:
• arXiv bulk data
• PubMed OA Subset
• Institutional repositories:
• ~3k across the world, see Directory of Open Access Repositories
• Aggregators:
• CORE (API, Data dumps)
Other full-text sources
Typically for non-commercial research only.
Named Entity
Relation extraction
Recognition (NER)
Semantic Analysis
search, Knowledge and
reasoning, ... base visualisation
Example: Part of speech tagging
[Acute]JJ [lymphoblastic]JJ [leukemia]NN ([ALL]NN) [leads]VB [to]IN [an]DT
[accumulation]NN [of]IN [immature]JJ [lymphoid]JJ [cells]NN [into]IN [the]DT [bone]NN
[marrow]NN, [blood]NN [and]CC [other]JJ [organs]NN.
NP
[with]IN [good]JJ [quality of life]PP.
Example: Named entity recognition
[Acute lymphoblastic leukemia]Disease ([ALL]Disease) [RESULT]VP [accumulation]
NN
[of]IN [immature lymphoid cells]AnatomicalSite [into]IN [the]DT [bone marrow]
AnatomicalSite
, [blood]AnatomicalSite [and]CC [other organs]AnatomicalSite.
Duration
[with]IN [good quality of life]Outcome.
Example: Named entity recognition
● Here, ALL is automatically expanded to its full form: acute lymphoblastic leukemia
● ALL is automatically labelled as a DiseaseOrSyndrome. as the initial mention was also labelled
as DiseaseOrSyndrome
Example: Named entity recognition
● Mimir
○ analysing, comparing and contrasting research findings
● EEXCESS
○ Recommend related resources
○ Narrative paths between resources
Existing challenges
● Harmonising metadata formats across publishers, repositories, etc.
● Agreeing on standards/ontologies/formats used to share the outputs of
research publications text-mining tools and all their components.
● Integration of text-mining tools with content providers’ systems
● Building and maintaining text-mining web-services for research (building
blocks)
● Promoting and adopting end-user tools utilising text-mining in
researchers’ daily workflows
● Building gold standards for various text-mining tasks and sharing them
across researchers (issue of credit)
Events and initiatives
● Events
○ International Workshop on Mining Scientific Publications (WOSP)
○ Joint Conference on Digital Libraries (JCDL)
○ ACL, COLING, IJC-NLP, CoNLL, LREC, etc.
● Projects
○ OpenMinTeD (EC funded)
○ EEXCESS
○ OpenAIRE
Thanks for listening ...
datascience@mendeley.com