Received January 21, 2021, accepted February 18, 2021, date of publication March 2, 2021, date of current version

March 22, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3063181

An Ontological Framework for Information

Extraction From Diverse Scientific Sources
1 Center of Intelligent and Autonomous System, Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat
86400, Malaysia
2 Barani Institute of Sciences (Sahiwal), PMAS Arid Agriculture University, Rawalpindi 46000, Pakistan
3 Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi

4 School of Information Technology, Deakin University, Geelong, VIC 3217, Australia

Corresponding authors: Hairulnizam Mahdin (hairuln@uthm.edu.my) and Jemal Abawajy (jemal.abawajy@deakin.edu.au)

This work was supported by the Ministry of Education Malaysia (MOE) through the Fundamental Research Grant Scheme for Research
Acculturation of Early Career Researchers (FRGS-Racer) under Grant RACER/1/2019/ICT04/UTHM/1 Vote: K154.

ABSTRACT Automatic information extraction from online published scientific documents is useful in
various applications such as tagging, web indexing and search engine optimization. As a result, automatic
information extraction has become among the hottest areas of research in text mining. Although various
information extraction techniques have been proposed in the literature, their efficiency demands domain
specific documents with static and well-defined format. Furthermore, their accuracy is challenged with a
slight modification in the format. To overcome these issues, a novel ontological framework for information
extraction (OFIE) using fuzzy rule-base (FRB) and word sense disambiguation (WSD) is proposed. The
proposed approach is validated with a significantly wider document domains sourced from well-known
publishing services such as IEEE, ACM, Elsevier, and Springer. We have also compared the proposed
information extraction approach against state-of-the-art techniques. The results of the experiment show that
the proposed approach is less sensitive to changes in the document format and has a significantly better
average accuracy of 89.14% and F-score as 89%.

INDEX TERMS Information extraction, semi structure scientific documents, fuzzy rule base, word sense
disambiguation, ontological framework.

I. INTRODUCTION by the research community and other interested parties. The

Scientific repositories maintained by research societies such published articles are hosted in the form of structured and
as IEEE, ACM, Elsevier and Springer have become an unstructured portable document format (PDF) of varying
increasingly important tool for diverse stakeholders that sizes. The structured PDF documents have all the necessary
include researchers, businesses, research institutions, govern- metadata information including the table of contents and
ment agencies as well as funding agencies [1]. These scien- sections information. However, the unstructured documents
tific repositories host millions of published documents that contain only the basic metadata fields that include date-time
provide rich and useful information to the stakeholders [2]. stamp, file size and name or page numbers and used fonts.
For example, as of October 2019, the IEEE database con- It is apparent that manually retrieving information from
tains 5 million documents [3]. Similarly, Elsevier publishes such huge documents is near to impossible [5]–[9]. Var-
more than 430,000 articles annually in 2,500 journals and ious techniques from different fields have been proposed
its archives contain over 13 million documents [4]. Also, to automate information extraction from scientific reposi-
Wiley Online Library have more than 4 million articles. tories [10]–[12]. These techniques include ontology-based,
They all contain some piece of information that is needed natural language processing (NLP), machine learning (ML),
The associate editor coordinating the review of this manuscript and conditional random fields (CRF) based information extrac-
approving it for publication was Davide Aloini. tion and some hybrid techniques [1], [2], [13], [14].

G. Zaman et al.: OFIE From Diverse Scientific Sources

However, automated content and metadata extraction from The rest of the paper is organized as follows. Section 2 dis-
the scientific repositories has remained challenging. Spe- cusses literature review and Section 3 contains the pro-
cially, the huge volume and the varying format of the doc- posed approach. Section 4 contains discussion of the results.
uments pose major technical challenges to efficiently extract Section 5 concludes the paper.
the desired information from the repositories. Even the search
engines are facing problems in indexing such massive vol- II. PROBLEM OVERVIEW AND LITERATURE REVIEW
ume and varying format documents [14]. This problem is In this section, we present an overview of the problem and a
getting worse as the volumes of the generated documents comprehensive literature review with emphases on the work
are exponentially increasing rapidly [15]–[17]. Moreover, related to the problem addressed in this paper.
the bulk of scientific documents hosted in the publishers
digital libraries [18], [19] are mostly unstructured documents, A. PROBLEM OVERVIEW
which presents a considerable challenge to reliably and Scientific repositories such as IEEE, Springer and ACM play
efficiently extract required information from such reposito- an increasingly important role in modern research. These
ries [20]–[22]. Although, in general both information extrac- repositories maintain a large number of documents on
tion and metadata extraction are sensitive to variations in research outcomes. Researchers produce massive amounts
the document formats and fields of metadata, exiting work of information from their research outcomes as the number
do not consider this issue. Also, the extraction of structural of published scholarly articles has increased between 8%
information from unstructured/semi-structured published sci- and 9% annually [25]. This also implies that researchers
entific articles has received little attention [7]–[9]. Therefore, consume massive amounts of information from many dif-
there is a strong need to overcome these challenges and ferent scientific sources. As modern research relies heavily
develop an efficient information extraction mechanism from on already published research results, effective access to the
the published scientific documents. huge scientific papers duly published by different publishers
In this paper, we propose an efficient ontology-based is crucial [2]. However, the process of Information Extraction
approach for structural information extraction from the scien- of these documents on specific subject is inefficient due to
tific documents. The proposed approach uses fuzzy rule-base the massive volumes of the documents and their structural
(FRB) and word sense disambiguation (WSD). The Fuzzy differences which results in poor indexing and inefficient
regular expressions (having in-built Levenshtein-distance information retrieval over the web [14], [26]. It is not easy
measure) enables the proposed approach to deal with struc- to extract information from these documents effectively and
tural variations and missing information (deleted, inserted, thus automatically extracting content and metadata from sci-
modified). The WSD helps in fixing the extracted information entific repositories remains a challenge. It is not easy to
using semantic similarity measure along with auto-correction extract information from these documents in an efficient man-
of words and generates the final stream. The proposed ner and therefore automatic retrieval of content and metadata
approach is effective to the type and amount of infor- from scientific repositories remains a challenge.
mation being extracted as well as able to take various Although various techniques have been proposed to
types of the targeted document into consideration like text, address these challenges [15], [16], [27] there are several gaps
docx and XML. We have conducted comprehensive exper- in the exiting work. Exiting solutions are mainly developed
iments using real-data sets from various scientific reposi- to address only a narrow domain and with very specific rules
tories and validated the proposed approach. We have also that are only applicable to limited formats. Moreover, when
compared the proposed approach with various baseline tech- the exiting techniques are investigated over slightly mod-
niques. The contribution of this paper can be summarized as ified document formats, their performance is considerably
follows: compromised [17]. Moreover, ontology based dynamic Infor-
• An adaptive and robust ontological framework for infor- mation Extraction framework for PDF and MS Word docu-
mation extraction (OFIE) using fuzzy rule-base sys- ments that recognizes a wide variety of document resources
tem (FRBS), plain & fuzzy regular expressions and published in scientific community and extracts the complete
word sense disambiguation (WSD) [23] using word2vec structural information from them has not been investigated so
approach [24] has been proposed. far. Although some of the concepts are partially investigated,
• An extensive experimental analysis of the proposed a proper hybridization of more than one technique as a frame-
approach with real data from various repositories mainly work for information extraction can be promising in terms of
from IEEE, ACM, Springer and Elsevier and some oth- accuracy and scope [28], [29]. A recent comprehensive litera-
ers, with significant variations. ture review about the information extraction techniques from
• We compare the proposed approach using quantitative unstructured and semi structured scientific resources have
(in terms of numbers and experimental outputs) and been discussed in [9]. The authors concluded that there is a
qualitative (in terms of the features and dimensions of dire need of a scheme that can comprehend diverse formats
extraction fields) methods and show that the proposed of scientific documents from various society repositories.
approach outperforms in terms of several performance These unsolved challenges have served as a prime moti-
metrics. vation behind the current study. To fill this research gaps,

G. Zaman et al.: OFIE From Diverse Scientific Sources

a new approach that can comprehensively and efficiently reference string parsing. Sometimes it is also called Citation
extract all pertinent information from the entire spectrum of Parsing or Citation Extraction from the given set of refer-
publications with precision is paramount. ences. Parsing means to resolve a sentence into its com-
ponents and describe their role according to the context.
B. RELATED WORK The given technique is mainly focused on the extraction of
A variety of techniques were proposed to automate the extrac- citations.
tion of information from documents in scientific repositories. In [42], the authors investigated conditional random fields
In this section, we present some of the most relevant and state- (CRF) and hidden Markov Model (HMM) techniques, and
of-the-art approaches. their estimation is improved and trained by using Particle
A rule-based method for information extraction from sci- Swarm Optimization (PSO). PSO searches the optimal value
entific documents in the form of XML or simple text is between CRF and HMM and finds the optimize answer.
discussed in [13]. The authors built an ontology and utilized Purpose of this technique was to generate the citations for
a rule-based approach after crafting the rules by observing the given paper including its digital object identifier (DoI),
the given dataset in the documents. The empirical tests were website, journal name, pages, date of publication and volume
performed on XML files with the help of PDFx online tool etc., mainly available on first page.
and an accuracy of 77.5% was observed. The major limitation In [43], the authors use CRF Chunker, because sometime
of the technique was that it was designed for only a spe- author and organization names make ambiguities. So, they
cific conference format paper. Various methods for extracting make a chunk of patterns for extraction. Same in case of
information from XML documents [30]–[34] and from plain- number of pages, days, volume number make ambiguity and
text document [35]–[38] have also been developed. However, location names may mix with organization name. The scheme
none of these approaches utilize the patterns in both XML used to extract references’ metadata from the last pages of
and Text formats to identify the desired information of the research paper that include author, title, date, pages, location,
published research articles. organization, journal, book title, publisher, website.
In [39], the authors used PDFBox tool to get pure text and In [44], the authors used Data Mining tool called Rapid
formatting values of the target document. The authors further Miner and SVM for Classification. The technique extracts
developed their own tool called PAXAT which works on rich metadata from research articles include a Title, Abstract, Key-
text features (RTF) of the document taken from published words, an Author Name, Affiliation, Email, and Address. The
article from ACM, IEEE, SPRINGER and ArVix. For exam- information was extracted from the first page of a research
ple, formatting values like the line height, font type, font size paper only. Also, the scheme is limited to two department
and alignment of different metadata items that include the papers dataset only.
title, author and affiliation (take help from given template). In [45], a CRF based method for extraction of cita-
It also works on redundant text such as dates that appear on tions from Bio-Medical papers that includes Title, Author,
both header and footer. The technique, however, was confined Source, Year and Volume is discussed. However, the tech-
to extracting the paper title, authors, and their affiliations nique focused on limited fields. In [46], a structural SVM
only. classification technique is used to extract references metadata
An approach called CERMINE (Content ExtRactor and including citation number, authors, title, journal, volume,
MINEr) is proposed to retrieve different parts of an arti- year, and page numbers from MEDLINE society. Experi-
cle in [28]. CERMINE implements different algorithms mental results showed that the approach performed better
for extracting different part of an article. For example, than the normal SVM and CRF based techniques in terms
the K-Mean algorithm is used for clustering of lines, Doc- of extraction accuracy. In [47], authors investigated Hidden
strum Algorithm is used for page segmentation, and the Markov Model (HMM) with the Viterbi algorithm. But the
Support Vector Machine (SVM) algorithm is used for clas- scheme is limited to references metadata extraction. In [48],
sification purpose. The CERMINE approach was a PDF to authors proposed long-short term memory (LSTM) based
Word conversion rather than just information extraction. deep learning approach for references extraction from the
In [40], the authors investigated PDFBox, TET (Text articles.
Extraction Toolkit), PDF2TET and Table Seer Algo- Table 1 summarizes the above state-of-the-art tech-
rithm techniques. Authors presented Tabular Ontology niques for information extraction including their limitations,
for extraction based on Semantic Relationship consist approach used and dataset information. The observed techni-
of Columns and Rows include Cell, header, Body and cal limitations to the existing schemes can be summarized as
Associated Text Regions. Authors use Pre-defined lay- follows:
out approaches, Border Lines, statistical approach, Heuris- a. Their effectiveness to the type (simple like title or
tic Approach. Rule Based Approach play important role complex like authors affiliation etc.)
in Table Data Extraction. The technique was designed for b. Amount of information being extracted (less informa-
tabular data extraction only. tion less issues and vice versa. Mainly schemes only
In [41], the authors investigated PDFBox and PARSCIT focus on limited information like available on first or
algorithm. Basically, The PARSCIT algorithm performs last page of the paper)

G. Zaman et al.: OFIE From Diverse Scientific Sources

TABLE 1. Summary of related work review.

c. Targeted corpus/structure (schemes perform good for from scientific documents available in PDF and DOC/DOCX
coherent structure of the document and fail on the formats, equipped with FRBS and WSD to investigate
diverse structures and/or with missing/modified infor- the accuracy with existing techniques based on empirical
mation) results.
d. Variations in the input document format (text, word, Table 2 contrasts the proposed scheme and state-of-the-art
XML etc.). Mainly schemes target XML as input doc- schemes (selected for comparison in the results section) in
ument type. terms of:
e. Post correction of extracted information (mainly • Information type (structural components) being
schemes are confined to the extracted information and extracted.
do not post-process or correct it). • Type of data set
Based on the discussion above and the comprehen- • Type of input documents (online published articles)
sive literature review, this study focuses on a novel • Format of input documents (XML/DOC converted PDF
ontology-based framework for information extraction (OFIE) documents)

G. Zaman et al.: OFIE From Diverse Scientific Sources

TABLE 2. Schemes selected for comparison with information fields.


INFORMATION EXTRACTION(OFIE) The scientific repositories currently maintain research arti-
In this section, we present the proposed ontological frame- cles in a PDF format. The first step is to convert the PDF
work for information extraction (OFIE) using fuzzy rule-base format to XML, plain text and DOC/DOCX formats. The
system (FRBS) and word sense disambiguation (WSD). main purpose for converting the source documents (articles)
In the proposed approach, a backend ontology, based on into more than one formats is to facilitate fine-grained infor-
which the entire IE framework is defined, will work as struc- mation extraction from the documents. The main purpose for
tural criteria for information extraction. Also, fuzzy regular converting the source documents (articles) into more than
expressions are used to address the variations in the extracted one formats is to facilitate fine-grained information extrac-
tokens as an extension to simple regular expressions. This tion from the documents. Our analysis of many documents
makes the proposed scheme robust in terms of information that include PDF, XML, plain text and DOC/DOCX formats
extraction from the documents with the formats different regarding ease of information extraction led us to conclude
from those considered in the examples during training phase. that some part of information may be better extracted from
Furthermore, instead of working on just one type of document XML, plain text, and/or from DOC/DOCX. For example,
conversion, multiple document conversions namely, XML, structural information such as title and author details can
Text and Word are processed under a separate rule for each be adequately extracted from XML document. Similarly,
conversion. Without loss of generality, the rules carved for figures and tables details can be extracted from text/docx
XML are not applicable to text and doc conversion and vice documents which are better than XML.
versa. That is why we use Word2Vec to achieve word sense Further, in this phase, unnecessary details, omission, data
disambiguation to derive more accuracy in the conversion cleansing and other type of pre-processing is performed.
process. In this phase different parts of the research papers like title,
Figure 1 depicts the proposed the schematic of the system authors, funding agency/ acknowledgement are identified
architecture for information extraction from the scientific using different rules (mainly in the form of regular expres-
documents. For clarity, the architecture is divided into four sions) and the rest of the tokens are discarded. For example,
phases. Detail of each phase is given subsequently. In the in this case, we are not interested in publication year, journal
following subsections, we will describe each component of ISSN and paper’s main text etc. so such information can
the proposed approach in detail. be filtered out. First, the given PDF document is converted

G. Zaman et al.: OFIE From Diverse Scientific Sources

FIGURE 1. The proposed OFIE architecture.

to XML, plain text and DOC/DOCX formats. Purpose for carried out by carving the rules (patterns) in the form of
converting the source documents (articles) into more than regular expressions (REGEX) to detect the desired token
one formats is that each conversion has its own pros and from the document text and extract the relevant informa-
cons and from the experiments, it is observed that some part tion. However, in case of REGEX, it looks for an exact
of information may be better extracted either from XML or match between the pattern and the text token. In case of
plain text, or DOCX. To compliment, all the formats are mismatch, document crawler is unable to extract the desired
used. Mainly, XML provides a better extraction support due information. To overcome this issue, the REGEX rules are
to its tag nature, DOC/DOCX provides rich text format (RTF) converted into Fuzzy regular expressions (FREGEX). In case
features like fonts, headings and numbering etc. for a better of FREGEX, it looks for an approximate match between
understanding and text provide plain text in a better way like pattern and text token. That approximation error is calculated
the abstract, acknowledgement part etc. by Levenshtein distance formula that equates two strings
even in case of deletion, insertion, and substitution of char-
acters in the text token [49]. The distance formula calcu-
B. STRUCTURAL INFORMATION EXTRACTION lates the error which is roughly equal to count of mistakes
In this phase, the structural information is extracted from divided by average between length of pattern and length
the documents converted in Phase A. This task is mainly of text token. Table 3 shows an example of fuzzy match

G. Zaman et al.: OFIE From Diverse Scientific Sources

TABLE 3. Error measure in FREJ. 3. After several experiments, these errors were obtained
again each society and its structural components and
consequently averaged. The average error can be writ-
ten mathematically as,

1 XX
Eavg = es.i (1)
S.I s
between pattern ‘‘abstract’’ and some erroneous text tokens
containing deleted, inserted, and modified characters or where S is total number of societies, I is total number of
substrings [50]. structural indices in a paper and es.i es.i is error against a
1) FUZZY RULE BASED SYSTEM particular society and its structural index.
Generally, the documents in the dataset repository have vary- 1) Heuristically, during investigating plain regular expres-
ing formats as described in the dataset subsection. By closely sions, it was observed that mainly the average error
observing the structure of these documents (belong to various (Eavg Eavg ) falls in the range 0 to 4.
societies), REGEX are carefully carved and converted to 2) That error is used as the tolerance variable determined
equivalent FREGEX. In this regard, it is very important to by FRBS for fuzzy regular expression used against the
figure out the appropriate FREGEX (index) upon detecting given society and structural index.
the format. Similarly, several types of information are being To fine tune the performance by tolerating the average error,
extracted from title to bibliography. This is referred to as a FRBS is designed to estimate the exact tolerance being used
structural index (SI) that specifies whether the given text by the fuzzy regular expression.
token is title, abstract, keyword etc. The third parameter in The sample rule can be expressed as:
this regard is tolerance (T) that specifies the extent to which IF (SOC = ‘index’ AND SI = ‘index’) THEN (AND
the distance between pattern and text token can be tolerated. T = V. High)
It is worth mentioning here, that high tolerance does not There are four main components of FRBS. Namely fuzzi-
always mean error is avoided. In some cases, more tolerance fier, defuzzifier, inference engine and the rule base. Here,
can result in poor detection and/or accuracy. we have used Triangular Fuzzifier, Center Average Defuzzi-
Based on these experiments, a bank of FREGEX for fier (CAD) and Mamdani Inference Engine (MIE) [51].
XML, text/docx pattern against various formats is created.
To automate the process and to get the appropriate tolerance C. WORD SENSE DISAMBIGUATION
against given society, and SI, the FRBS is designed in MAT-
As shown in Figure 1, this module is responsible for synthe-
LAB. There are two input variables namely, society (Soc),
sizing and fine tuning the extracted information from XML
and Structural Index (SI) while there is one output variable
and text/docx converted versions in Phase B. Among the
and Tolerance (T). This relationship is shown in Figure 2.
primary challenges in structural information extraction is
This is mainly because information extraction from differ-
information loss that occurs while converting the input docu-
ent societies (IEEE, ACM etc.) exhibits different number of
ment from one format to another. Many PDF to text, xml and
errors due to variations in structural components (title, name
word conversion libraries result into errors during the phase
etc.), as observed by experimentation.
of conversion. These errors tend to affect the performance
of extraction task. To mitigate such type of errors, WSD is
a necessity and a value addition in the proposed technique
to improve the accuracy in the overall information extraction
process. In the proposed scheme, WSD is performed using
word to vector (word2vec) approach that is a Neural Network
based similarity model where the words or the concepts are
represented in terms of n-dimensional vectors in a huge vector
space [24]. The model performs autocorrection of misspelled
FIGURE 2. Schematic of fuzzy system. words, word sense correction and sentence sense/sequence
correction by augmenting the two streams duly extracted
After several experiments on extracting information from from same document converted in XML and text/docx for-
several societies with several formats, following observations mats. The process is shown in Figure 3. Upon receiving
were made (heuristics). both streams, it creates the order/semantic vector with the
1. In one society information against a certain component help of lexical database and corpus. Consequently, it per-
extracted without any error or single error. forms word order/ semantic similarity and after applying the
2. In second society information against same component sentence similarity it generates the final synthesized output
extracted with a greater number of errors. stream.

G. Zaman et al.: OFIE From Diverse Scientific Sources

FIGURE 4. Proposed Ontology of research Paper.

FIGURE 3. Word Sense disambiguation.


As depicted in Figure 1, this module stores, manages and
retrieve all the extracted structural information from the
scientific documents. Ontologies are meant to comprehend
and synthesized the information in a particular field [25].
For example, authors, title, sub-title, sections, and other
required information. The ontology is designed/engineered
in Protégé software. Figure 4 shows the proposed ontology
of the scientific documents. After successful extraction, all
the information is sent to the RDF (Resource Description FIGURE 5. Prototype main screen.

Framework) generator block [25]. The block generates the TABLE 4. Implementation detail.
subject, the predicate (property), and the object. The relation-
ship of the three is called a triple. All the triples are stored
in a triple store, where the SPARQL queries can be applied
to search the results. As an alternate approach, the extracted
information can be inserted into a Relational Database and
searched/retrieved by SQL queries instead. Moreover, it can
be exported as comma separated values (CSV) and accessed
in MS Excel.
Consequently, the ontology is mapped to the digital library
for utilization of the extracted information [18], [19].


This section presents the implementation of the proposed
scheme and analysis of the results. Performance metrics along
with the analysis of the results are discussed. We also com- The files are uploaded and all necessary information such
pared the proposed work against the approaches proposed as author name, email, affiliation, and other contents are
in [10], [13], [28]. The reason behind selection of these extracted. Generation of text and docx from PDF writer is
techniques is stated in the ‘‘Related Work’’ and Table 2. also automated process but for clarification and differen-
tiation purposes, separate buttons are provided in the pro-
A. IMPLEMENTATION totype. To reveal the information extracted from the tabs,
We have implemented the proposed approach and table 4 the extracted information is categorized (in tabs) as Title and
shows the software packages and libraries used in the imple- Author, Keywords, Headings, List of Figures (captions), List
mentation. Figure 5 shows the main screen of the prototype. of Tables (captions), References and Acknowledgement.

G. Zaman et al.: OFIE From Diverse Scientific Sources

In the current view, the extracted Title and Author names B. EVALUATION METRICS
are displayed. Here Fuzzy REGEX (regular expression) aug- The performance is measured in terms of three metrics: pre-
mented with word sense disambiguation, helps in precisely cision (P), recall (R) and F-measure (F). Same performance
identifying the author names. As different societies have metrices have been used in previous works [10], [13], [28].
different ways to write the author names like first and last These performance metrics are defined as follows [13], [28]:
name, first second and last name and so on, there was a dire TP
need to have an approach to disembogue this issue. Further, R= (2)
it shows the set of extracted keywords for the same published TP
document. The Fuzzy REGEX rules obtain all variations of P= (3)
keywords styles associated with various publishing societies. 2PR
Different societies use different characters to separate the F − measure = (4)
keywords like comma, semi-colon, line separation etc. In the
shown example, comma separation is detected. Similarly, where TP refers to the true positives or the number of rows
it shows the extracted abstract from the sample paper. to which the scheme correctly assigned the category it recog-
Next, all the extracted Headings of the research paper nizes (Title, Authors etc.). In the case when that category is
including main and sub-sections are depicted. This infor- incorrectly assigned a false positive (FP) is generated, while
mation can be used for generating table of contents against FN (false negatives) represents the number of rows for which
the document for better indexing, search, and retrieval. Here the scheme was not able to recognize the category it was
Fuzzy regular expression was helpful in identifying the main constructed for.
and sub-sections of any depth like section number
Further, it shows the extracted list of figures’ captions in
the input scientific document. The documents may contain For the evaluation of the proposed scheme, we used 500
arbitrary number of figures and the proposed scheme can published papers composed of journals, conferences, etc.
extract all the figures’ captions precisely. Again, various sourced from several scientific repositories. Table 5 shows
societies have different ways to present the figure captions, the dataset along the various sources and distribution. For
the proposed scheme not only extract the figures captions instance, the variations of 12 ACM data sources are given
for the underlying four societies (ACM, IEEE, Elsevier, and in Figure 7. The data set was divided for training and test-
Springer) but other societies’ papers can also be treated with ing phases as 70% and 30% division, respectively. Also,
the significant accuracy. Here Fuzzy REGEX can precisely the CORA dataset is used.
unify the tokens like ‘‘Figure’’, ‘‘Fig’’ etc. followed by a ‘.’, TABLE 5. Dataset used in the evaluation.
‘:’ and/or space.
Similar, approach is used for Table captions (next tab).
Similarly, it shows the extracted captions of the tables in
the paper. The tables’ caption formats vary from society to
society like word ‘table’ or ‘tab’, single line or multi-line
caption text, and various types of table numbering styles etc.
Likewise, it shows the extracted acknowledgement section of
the paper. It is normally an optional part in many societies.
It comes under different heading like ‘special thanks’, ‘fund-
ing agency’ and ‘acknowledgement’ while sometimes both D. COMPARISON
‘funding agency and acknowledgement’ etc. This section In this section the proposed approach is compared with sim-
carries the authors’ acknowledgements to a person, an orga- ilar approaches in the literature. Two types of comparisons
nization, and/or the funding agency etc. In case the paper are made, that are quantitative (in terms of numbers and
does not contain information, a null string will be returned. experimental outputs) and qualitative (in terms of the features
The length of acknowledgement section can be arbitrary in and dimensions of extraction fields).
this experiment. Most of the societies provide this section To make the comparison fair, we evaluated the proposed
and then it is up to the authors whether they utilize it or scheme and the other schemes using the datasets used in [10],
not. Finally, it shows the extracted references from the. It [13], [28]. Table 6 shows the precision (in percentage) in
is well-known fact there are several ways for references. training and testing phases for various sections of the paper
We have addressed all the well-known as well as the cus- separately. At the end aggregate averages of all the sections
tomized reference styles in the FREJ, like APA, IEEE etc. is calculated which are 89.14% and 91.21%, respectively for
The scheme is robust against several standard reference styles testing and training phases, respectively.
used in the target societies as well as many others in the
literature. Figure 6 shows the extracted information from a 1) QUANTITATIVE COMPARISON (EMPIRICAL)
sample document in a summarized form, duly collected from The comparison is made in terms of precision, F-score, and
the tabs shown in Figure 5. recall, and it is given in Table 7. The proposed scheme is better

G. Zaman et al.: OFIE From Diverse Scientific Sources

FIGURE 6. Structural information extracted from an example article.

in terms of precision, recall and F-score. However, the scheme information extraction rather that whole paper’s information.
is [10] has a performance closer to the proposed scheme but Nonetheless, the schemes in [13] and CERMINE [28] are
it is worth mentioning here, that it only focuses on tabular like the proposed one in terms of type of information to be

42120 VOLUME 9, 2021

G. Zaman et al.: OFIE From Diverse Scientific Sources

TABLE 6. Section based precision in testing and training phases. TABLE 8. Comparison with [28] for sub-sections.

FIGURE 8. Comparison in training phase.

FIGURE 7. ACM dataset sources.

TABLE 7. Comparison based on overall extraction results.

FIGURE 9. Comparison in testing phase.

in [28], authors are intended to extract the entire article

text but for the comparison purpose we only consider the
extracted. It is apparent that the proposed scheme performs common parts. Similarly, the entries marked as ‘X’ are not
significantly better in terms of precision, recall and F-score, covered in [28] that are headings, acknowledgement, list of
compared to all the schemes. figures and list of tables. Moreover, the [28] was compared
Since the scheme in [28] provides granular level precision on the same dataset.
values for the components like title, name etc., a detailed Figure 8 and Figure 9 show the comparison of the schemes
comparison is given in Table 8. The proposed scheme outper- in training and testing phases, respectively. It is comprised
forms for several sections like title, authors’ names, authors’ of the values given in Table 7. In Figure 9, the recall value
affiliation, authors’ email, and keywords. However, for refer- of [10] is slightly better than the proposed scheme in testing
ences/bibliography section the scheme in [28] has 4.1% better phase. However, in terms of precision and F-score, proposed
accuracy. Moreover, the sections like headings, acknowledge- scheme outperforms in both testing and training phases in all
ment, figures, and tables are not covered in [28]. Moreover, schemes.

G. Zaman et al.: OFIE From Diverse Scientific Sources

TABLE 9. Comparison. the approaches are working in a sequential way, so it is

apparent that the former component’s efficacy will strengthen
the next component as so on. For instance, if the fuzzy regular
expression does not provide the due input to the WSD, it will
not be able to fine tune the accuracy. Nonetheless, on a
nutshell, intuitively, it can be safely stated that the system’s
accuracy is mainly enriched by the fuzzy regular expressions
followed by the WSD semantic and syntactical fine tuning.
This paper proposed an ontological framework for infor-
mation extraction, repository and retrieval using Fuzzy rule
base and word sense disambiguation. The Fuzzy Rule Base
transforms the plain regular expression into fuzzy regular
expressions with a tunable tolerance in terms of insertion,
deletion, and substitution errors in the pattern. Four research
societies are targeted for the information extraction, though
the proposed scheme works significantly well for the other
societies as well. Once the information is extracted, it is trans-
formed into an RDF object and stored in the compatible RDF
triple store for efficient retrieval. The proposed approach can
be a great help in building the digital libraries supported
with an automatic ETL (extraction, transformation, and load-
ing) process. The proposed scheme is promising in terms
of computational complexity as well as accuracy. In future,
2) QUALITATIVE COMPARISON the scheme may be extended to involve machine learning for
This part contains the qualitative comparison of the proposed automated information extraction that can encompass wider
scheme with the previous schemes. The comparison is given number of societies and corpuses.
in Table 9. From the comparison, it is apparent that the It is worth investigating machine learning and evolutionary
proposed scheme is superior to [10] in terms of obtaining the computing techniques especially their hybrid counterparts
structural information from the document. Similarly, the pro- for sake of information extraction especially in the domain
posed scheme is superior to the approaches proposed in [10] of scientific (Wiley, BMC etc.) research under the proposed
and in [13] in terms of the target range and diversity of framework. Further experiments that include scientific pub-
the document formats and volume of information extracted. lication from other societies is part of our future work. The
As far as scheme in [28] is concerned, it is mainly focusing on information extraction process, in practice as well as in the
metadata, full text of the document, complete breakdown of proposed approach is assumed to be a backend/offline process
the bibliography part (title, name, volume, publishing venue, where there are several documents are downloaded and infor-
pages, and years etc.). In contrast to the proposed scheme mation is extracted. However, it is important to also consider
where we are mainly interested in structural information the difficulty component of the solution for a digital library
(title, author, details, table of contents, references and list and information system in real time. In future, this factor may
of figures and tables) which is more than just metadata. also be assessed for the existing as well as the upcoming
However, full-text extraction is not part of the scope of the schemes.
proposed scheme. That is why, Table 6 contains a separate ACKNOWLEDGMENT
comparison with [28] section-wise details, table of contents, The help of Maliha Omar, Mohib Ullah Khan and
references and list of figures and tables) which is more than Umar Farooq is greatly appreciated.
just metadata. However, full-text extraction is not part of the
scope of the paper. REFERENCES
G. Zaman et al.: OFIE From Diverse Scientific Sources

GOHAR ZAMAN is currently a Postgraduate ATTA-UR-RAHMAN received the B.S. degree in

Research Student with the Faculty of Computer computer science from the University of the Pun-
Science and Information Technology (FSKTM), jab, Lahore, Pakistan, in 2004, the M.S. degree in
Universiti Tun Hussein Onn Malaysia. His EE from International Islamic University, Islam-
research interests include information extraction, abad, Pakistan, in 2008, and the Ph.D. degree in
data mining, ontologies, NLP, and automatic text EE from ISRA University, Islamabad, in 2012.
categorization. He is currently working as an Assistant Profes-
sor with the College of Computer Science and
Information Technology, Imam Abdulrahman Bin
Faisal University (IAU), Dammam, Saudi Ara-
bia. Since 2003, he has been involved in teaching and research. He has
authored/coauthored more than 100 publications in conferences, books, and
journals of good reputation. His research interests include digital communi-
cation, DSP, information and coding theory, AI, and applied soft computing.


currently an Associate Professor with the Faculty
of Computer Science and Information Technology,
Universiti Tun Hussein Onn Malaysia. His cur- JEMAL ABAWAJY (Senior Member, IEEE) is cur-
rent research interests include IoT and blockchain. rently a Full Professor with the Faculty of Sci-
He is a member of the Malaysia Board of Technol- ence, Engineering, and Built Environment, Deakin
ogist (MBOT). He has been actively involved in University, Australia. His leadership is extensive
many conferences internationally serving as con- spanning industrial, academic, and professional
ferences in various capacity, including chair, gen- areas. He is also the Director of the Distributing
eral co-chair, vice-chair, best paper award chair, System Security (DSS). He is actively involved
publication chair, session chair, and program committee. He has also guest in funded research supervising a large number of
edited many special issue journals. Ph.D. students, postdoctoral researchers, research
assistants, and visiting scholars in the area of cloud
computing, big data, network and system security, decision support systems,
and e-health. He is the author/coauthor of five books and ten conference
volumes, more than 250 refereed articles in conferences, book chapters,
and journals. He is a Senior Member of the IEEE Technical Commit-
KHALID HUSSAIN joined Academia, in 2008, tee on Scalable Computing (TCSC), the IEEE Technical Committee on
as a full-time Faculty Member. He is currently Dependable Computing and Fault Tolerance, and the IEEE Communication
working as a Professor and the Dean Faculty of Society. He has been actively involved in the organization of more than
Computing, Barani Institute of Sciences Sahiwal. 200 national and international conferences in various capacity, including the
He is also working as a Campus Director in Chair, the General Co-Chair, the Vice-Chair, the Best Paper Award Chair,
Burewla Campus. He has vast university/industry the Publication Chair, the Session Chair, and a Program Committee Member.
experience. During his tenure in the industry, He has served on the Editorial Board of numerous international journals.
he served in the defense related projects and in
recognition of his services, he has been awarded
commendation certificates by multiple govern-
ment agencies. He has been involved in numerous research projects.
He helped to setup a pioneer setup for information/network security certifi- SALAMA A. MOSTAFA received the B.Sc. degree
cation in Pakistan. He also introduced EC Council certification under the first in computer science from the University of Mosul,
academia industry partnership. He did his Ph.D. from Malaysia, under a fully Iraq, in 2003, and the M.Sc. and Ph.D. degrees
funded UTM/HEC scholarship. He published 63 articles. In which 27 are ISI in information and communication technology
Indexed Impact Factor, 13 are in HEC approved journal and 23 are in IEEE from the Universiti Tenaga Nasional (UNITEN),
and ACM conferences. He also has a book chapter and three books with the Malaysia, in 2011 and 2016, respectively. He is
title Information Security Handbook is going to publish in couple of months. currently a Lecturer with the Department of Soft-
He has successfully completed six applied research project in the domain ware Engineering, Faculty of Computer Science
of information security funded by NESCOM. Up till now 37 M.S. students and Information Technology, Universiti Tun Hus-
completed his research thesis under his supervision. He is also supervising sein Onn Malaysia (UTHM). His research inter-
13 M.S. and five Ph.D. student in which two Ph.D. students completed their ests include soft computing, data mining, software agents, and intelligent
Ph.D. He received the Gold Medal for his contribution towards Information autonomous systems.
Security SATHA, in 2015.

