An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
An - Ontological - Framework - For - Information - Extraction - From - Diverse - Scientific - Sources-Gohar-Zaman SB
Arabia
4 School of Information Technology, Deakin University, Geelong, VIC 3217, Australia
ABSTRACT Automatic information extraction from online published scientific documents is useful in
various applications such as tagging, web indexing and search engine optimization. As a result, automatic
information extraction has become among the hottest areas of research in text mining. Although various
information extraction techniques have been proposed in the literature, their efficiency demands domain
specific documents with static and well-defined format. Furthermore, their accuracy is challenged with a
slight modification in the format. To overcome these issues, a novel ontological framework for information
extraction (OFIE) using fuzzy rule-base (FRB) and word sense disambiguation (WSD) is proposed. The
proposed approach is validated with a significantly wider document domains sourced from well-known
publishing services such as IEEE, ACM, Elsevier, and Springer. We have also compared the proposed
information extraction approach against state-of-the-art techniques. The results of the experiment show that
the proposed approach is less sensitive to changes in the document format and has a significantly better
average accuracy of 89.14% and F-score as 89%.
INDEX TERMS Information extraction, semi structure scientific documents, fuzzy rule base, word sense
disambiguation, ontological framework.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 42111
G. Zaman et al.: OFIE From Diverse Scientific Sources
However, automated content and metadata extraction from The rest of the paper is organized as follows. Section 2 dis-
the scientific repositories has remained challenging. Spe- cusses literature review and Section 3 contains the pro-
cially, the huge volume and the varying format of the doc- posed approach. Section 4 contains discussion of the results.
uments pose major technical challenges to efficiently extract Section 5 concludes the paper.
the desired information from the repositories. Even the search
engines are facing problems in indexing such massive vol- II. PROBLEM OVERVIEW AND LITERATURE REVIEW
ume and varying format documents [14]. This problem is In this section, we present an overview of the problem and a
getting worse as the volumes of the generated documents comprehensive literature review with emphases on the work
are exponentially increasing rapidly [15]–[17]. Moreover, related to the problem addressed in this paper.
the bulk of scientific documents hosted in the publishers
digital libraries [18], [19] are mostly unstructured documents, A. PROBLEM OVERVIEW
which presents a considerable challenge to reliably and Scientific repositories such as IEEE, Springer and ACM play
efficiently extract required information from such reposito- an increasingly important role in modern research. These
ries [20]–[22]. Although, in general both information extrac- repositories maintain a large number of documents on
tion and metadata extraction are sensitive to variations in research outcomes. Researchers produce massive amounts
the document formats and fields of metadata, exiting work of information from their research outcomes as the number
do not consider this issue. Also, the extraction of structural of published scholarly articles has increased between 8%
information from unstructured/semi-structured published sci- and 9% annually [25]. This also implies that researchers
entific articles has received little attention [7]–[9]. Therefore, consume massive amounts of information from many dif-
there is a strong need to overcome these challenges and ferent scientific sources. As modern research relies heavily
develop an efficient information extraction mechanism from on already published research results, effective access to the
the published scientific documents. huge scientific papers duly published by different publishers
In this paper, we propose an efficient ontology-based is crucial [2]. However, the process of Information Extraction
approach for structural information extraction from the scien- of these documents on specific subject is inefficient due to
tific documents. The proposed approach uses fuzzy rule-base the massive volumes of the documents and their structural
(FRB) and word sense disambiguation (WSD). The Fuzzy differences which results in poor indexing and inefficient
regular expressions (having in-built Levenshtein-distance information retrieval over the web [14], [26]. It is not easy
measure) enables the proposed approach to deal with struc- to extract information from these documents effectively and
tural variations and missing information (deleted, inserted, thus automatically extracting content and metadata from sci-
modified). The WSD helps in fixing the extracted information entific repositories remains a challenge. It is not easy to
using semantic similarity measure along with auto-correction extract information from these documents in an efficient man-
of words and generates the final stream. The proposed ner and therefore automatic retrieval of content and metadata
approach is effective to the type and amount of infor- from scientific repositories remains a challenge.
mation being extracted as well as able to take various Although various techniques have been proposed to
types of the targeted document into consideration like text, address these challenges [15], [16], [27] there are several gaps
docx and XML. We have conducted comprehensive exper- in the exiting work. Exiting solutions are mainly developed
iments using real-data sets from various scientific reposi- to address only a narrow domain and with very specific rules
tories and validated the proposed approach. We have also that are only applicable to limited formats. Moreover, when
compared the proposed approach with various baseline tech- the exiting techniques are investigated over slightly mod-
niques. The contribution of this paper can be summarized as ified document formats, their performance is considerably
follows: compromised [17]. Moreover, ontology based dynamic Infor-
• An adaptive and robust ontological framework for infor- mation Extraction framework for PDF and MS Word docu-
mation extraction (OFIE) using fuzzy rule-base sys- ments that recognizes a wide variety of document resources
tem (FRBS), plain & fuzzy regular expressions and published in scientific community and extracts the complete
word sense disambiguation (WSD) [23] using word2vec structural information from them has not been investigated so
approach [24] has been proposed. far. Although some of the concepts are partially investigated,
• An extensive experimental analysis of the proposed a proper hybridization of more than one technique as a frame-
approach with real data from various repositories mainly work for information extraction can be promising in terms of
from IEEE, ACM, Springer and Elsevier and some oth- accuracy and scope [28], [29]. A recent comprehensive litera-
ers, with significant variations. ture review about the information extraction techniques from
• We compare the proposed approach using quantitative unstructured and semi structured scientific resources have
(in terms of numbers and experimental outputs) and been discussed in [9]. The authors concluded that there is a
qualitative (in terms of the features and dimensions of dire need of a scheme that can comprehend diverse formats
extraction fields) methods and show that the proposed of scientific documents from various society repositories.
approach outperforms in terms of several performance These unsolved challenges have served as a prime moti-
metrics. vation behind the current study. To fill this research gaps,
a new approach that can comprehensively and efficiently reference string parsing. Sometimes it is also called Citation
extract all pertinent information from the entire spectrum of Parsing or Citation Extraction from the given set of refer-
publications with precision is paramount. ences. Parsing means to resolve a sentence into its com-
ponents and describe their role according to the context.
B. RELATED WORK The given technique is mainly focused on the extraction of
A variety of techniques were proposed to automate the extrac- citations.
tion of information from documents in scientific repositories. In [42], the authors investigated conditional random fields
In this section, we present some of the most relevant and state- (CRF) and hidden Markov Model (HMM) techniques, and
of-the-art approaches. their estimation is improved and trained by using Particle
A rule-based method for information extraction from sci- Swarm Optimization (PSO). PSO searches the optimal value
entific documents in the form of XML or simple text is between CRF and HMM and finds the optimize answer.
discussed in [13]. The authors built an ontology and utilized Purpose of this technique was to generate the citations for
a rule-based approach after crafting the rules by observing the given paper including its digital object identifier (DoI),
the given dataset in the documents. The empirical tests were website, journal name, pages, date of publication and volume
performed on XML files with the help of PDFx online tool etc., mainly available on first page.
and an accuracy of 77.5% was observed. The major limitation In [43], the authors use CRF Chunker, because sometime
of the technique was that it was designed for only a spe- author and organization names make ambiguities. So, they
cific conference format paper. Various methods for extracting make a chunk of patterns for extraction. Same in case of
information from XML documents [30]–[34] and from plain- number of pages, days, volume number make ambiguity and
text document [35]–[38] have also been developed. However, location names may mix with organization name. The scheme
none of these approaches utilize the patterns in both XML used to extract references’ metadata from the last pages of
and Text formats to identify the desired information of the research paper that include author, title, date, pages, location,
published research articles. organization, journal, book title, publisher, website.
In [39], the authors used PDFBox tool to get pure text and In [44], the authors used Data Mining tool called Rapid
formatting values of the target document. The authors further Miner and SVM for Classification. The technique extracts
developed their own tool called PAXAT which works on rich metadata from research articles include a Title, Abstract, Key-
text features (RTF) of the document taken from published words, an Author Name, Affiliation, Email, and Address. The
article from ACM, IEEE, SPRINGER and ArVix. For exam- information was extracted from the first page of a research
ple, formatting values like the line height, font type, font size paper only. Also, the scheme is limited to two department
and alignment of different metadata items that include the papers dataset only.
title, author and affiliation (take help from given template). In [45], a CRF based method for extraction of cita-
It also works on redundant text such as dates that appear on tions from Bio-Medical papers that includes Title, Author,
both header and footer. The technique, however, was confined Source, Year and Volume is discussed. However, the tech-
to extracting the paper title, authors, and their affiliations nique focused on limited fields. In [46], a structural SVM
only. classification technique is used to extract references metadata
An approach called CERMINE (Content ExtRactor and including citation number, authors, title, journal, volume,
MINEr) is proposed to retrieve different parts of an arti- year, and page numbers from MEDLINE society. Experi-
cle in [28]. CERMINE implements different algorithms mental results showed that the approach performed better
for extracting different part of an article. For example, than the normal SVM and CRF based techniques in terms
the K-Mean algorithm is used for clustering of lines, Doc- of extraction accuracy. In [47], authors investigated Hidden
strum Algorithm is used for page segmentation, and the Markov Model (HMM) with the Viterbi algorithm. But the
Support Vector Machine (SVM) algorithm is used for clas- scheme is limited to references metadata extraction. In [48],
sification purpose. The CERMINE approach was a PDF to authors proposed long-short term memory (LSTM) based
Word conversion rather than just information extraction. deep learning approach for references extraction from the
In [40], the authors investigated PDFBox, TET (Text articles.
Extraction Toolkit), PDF2TET and Table Seer Algo- Table 1 summarizes the above state-of-the-art tech-
rithm techniques. Authors presented Tabular Ontology niques for information extraction including their limitations,
for extraction based on Semantic Relationship consist approach used and dataset information. The observed techni-
of Columns and Rows include Cell, header, Body and cal limitations to the existing schemes can be summarized as
Associated Text Regions. Authors use Pre-defined lay- follows:
out approaches, Border Lines, statistical approach, Heuris- a. Their effectiveness to the type (simple like title or
tic Approach. Rule Based Approach play important role complex like authors affiliation etc.)
in Table Data Extraction. The technique was designed for b. Amount of information being extracted (less informa-
tabular data extraction only. tion less issues and vice versa. Mainly schemes only
In [41], the authors investigated PDFBox and PARSCIT focus on limited information like available on first or
algorithm. Basically, The PARSCIT algorithm performs last page of the paper)
c. Targeted corpus/structure (schemes perform good for from scientific documents available in PDF and DOC/DOCX
coherent structure of the document and fail on the formats, equipped with FRBS and WSD to investigate
diverse structures and/or with missing/modified infor- the accuracy with existing techniques based on empirical
mation) results.
d. Variations in the input document format (text, word, Table 2 contrasts the proposed scheme and state-of-the-art
XML etc.). Mainly schemes target XML as input doc- schemes (selected for comparison in the results section) in
ument type. terms of:
e. Post correction of extracted information (mainly • Information type (structural components) being
schemes are confined to the extracted information and extracted.
do not post-process or correct it). • Type of data set
Based on the discussion above and the comprehen- • Type of input documents (online published articles)
sive literature review, this study focuses on a novel • Format of input documents (XML/DOC converted PDF
ontology-based framework for information extraction (OFIE) documents)
to XML, plain text and DOC/DOCX formats. Purpose for carried out by carving the rules (patterns) in the form of
converting the source documents (articles) into more than regular expressions (REGEX) to detect the desired token
one formats is that each conversion has its own pros and from the document text and extract the relevant informa-
cons and from the experiments, it is observed that some part tion. However, in case of REGEX, it looks for an exact
of information may be better extracted either from XML or match between the pattern and the text token. In case of
plain text, or DOCX. To compliment, all the formats are mismatch, document crawler is unable to extract the desired
used. Mainly, XML provides a better extraction support due information. To overcome this issue, the REGEX rules are
to its tag nature, DOC/DOCX provides rich text format (RTF) converted into Fuzzy regular expressions (FREGEX). In case
features like fonts, headings and numbering etc. for a better of FREGEX, it looks for an approximate match between
understanding and text provide plain text in a better way like pattern and text token. That approximation error is calculated
the abstract, acknowledgement part etc. by Levenshtein distance formula that equates two strings
even in case of deletion, insertion, and substitution of char-
acters in the text token [49]. The distance formula calcu-
B. STRUCTURAL INFORMATION EXTRACTION lates the error which is roughly equal to count of mistakes
In this phase, the structural information is extracted from divided by average between length of pattern and length
the documents converted in Phase A. This task is mainly of text token. Table 3 shows an example of fuzzy match
TABLE 3. Error measure in FREJ. 3. After several experiments, these errors were obtained
again each society and its structural components and
consequently averaged. The average error can be writ-
ten mathematically as,
1 XX
Eavg = es.i (1)
S.I s
i
between pattern ‘‘abstract’’ and some erroneous text tokens
containing deleted, inserted, and modified characters or where S is total number of societies, I is total number of
substrings [50]. structural indices in a paper and es.i es.i is error against a
1) FUZZY RULE BASED SYSTEM particular society and its structural index.
Generally, the documents in the dataset repository have vary- 1) Heuristically, during investigating plain regular expres-
ing formats as described in the dataset subsection. By closely sions, it was observed that mainly the average error
observing the structure of these documents (belong to various (Eavg Eavg ) falls in the range 0 to 4.
societies), REGEX are carefully carved and converted to 2) That error is used as the tolerance variable determined
equivalent FREGEX. In this regard, it is very important to by FRBS for fuzzy regular expression used against the
figure out the appropriate FREGEX (index) upon detecting given society and structural index.
the format. Similarly, several types of information are being To fine tune the performance by tolerating the average error,
extracted from title to bibliography. This is referred to as a FRBS is designed to estimate the exact tolerance being used
structural index (SI) that specifies whether the given text by the fuzzy regular expression.
token is title, abstract, keyword etc. The third parameter in The sample rule can be expressed as:
this regard is tolerance (T) that specifies the extent to which IF (SOC = ‘index’ AND SI = ‘index’) THEN (AND
the distance between pattern and text token can be tolerated. T = V. High)
It is worth mentioning here, that high tolerance does not There are four main components of FRBS. Namely fuzzi-
always mean error is avoided. In some cases, more tolerance fier, defuzzifier, inference engine and the rule base. Here,
can result in poor detection and/or accuracy. we have used Triangular Fuzzifier, Center Average Defuzzi-
Based on these experiments, a bank of FREGEX for fier (CAD) and Mamdani Inference Engine (MIE) [51].
XML, text/docx pattern against various formats is created.
To automate the process and to get the appropriate tolerance C. WORD SENSE DISAMBIGUATION
against given society, and SI, the FRBS is designed in MAT-
As shown in Figure 1, this module is responsible for synthe-
LAB. There are two input variables namely, society (Soc),
sizing and fine tuning the extracted information from XML
and Structural Index (SI) while there is one output variable
and text/docx converted versions in Phase B. Among the
and Tolerance (T). This relationship is shown in Figure 2.
primary challenges in structural information extraction is
This is mainly because information extraction from differ-
information loss that occurs while converting the input docu-
ent societies (IEEE, ACM etc.) exhibits different number of
ment from one format to another. Many PDF to text, xml and
errors due to variations in structural components (title, name
word conversion libraries result into errors during the phase
etc.), as observed by experimentation.
of conversion. These errors tend to affect the performance
of extraction task. To mitigate such type of errors, WSD is
a necessity and a value addition in the proposed technique
to improve the accuracy in the overall information extraction
process. In the proposed scheme, WSD is performed using
word to vector (word2vec) approach that is a Neural Network
based similarity model where the words or the concepts are
represented in terms of n-dimensional vectors in a huge vector
space [24]. The model performs autocorrection of misspelled
FIGURE 2. Schematic of fuzzy system. words, word sense correction and sentence sense/sequence
correction by augmenting the two streams duly extracted
After several experiments on extracting information from from same document converted in XML and text/docx for-
several societies with several formats, following observations mats. The process is shown in Figure 3. Upon receiving
were made (heuristics). both streams, it creates the order/semantic vector with the
1. In one society information against a certain component help of lexical database and corpus. Consequently, it per-
extracted without any error or single error. forms word order/ semantic similarity and after applying the
2. In second society information against same component sentence similarity it generates the final synthesized output
extracted with a greater number of errors. stream.
Framework) generator block [25]. The block generates the TABLE 4. Implementation detail.
subject, the predicate (property), and the object. The relation-
ship of the three is called a triple. All the triples are stored
in a triple store, where the SPARQL queries can be applied
to search the results. As an alternate approach, the extracted
information can be inserted into a Relational Database and
searched/retrieved by SQL queries instead. Moreover, it can
be exported as comma separated values (CSV) and accessed
in MS Excel.
Consequently, the ontology is mapped to the digital library
for utilization of the extracted information [18], [19].
In the current view, the extracted Title and Author names B. EVALUATION METRICS
are displayed. Here Fuzzy REGEX (regular expression) aug- The performance is measured in terms of three metrics: pre-
mented with word sense disambiguation, helps in precisely cision (P), recall (R) and F-measure (F). Same performance
identifying the author names. As different societies have metrices have been used in previous works [10], [13], [28].
different ways to write the author names like first and last These performance metrics are defined as follows [13], [28]:
name, first second and last name and so on, there was a dire TP
need to have an approach to disembogue this issue. Further, R= (2)
TP + FN
it shows the set of extracted keywords for the same published TP
document. The Fuzzy REGEX rules obtain all variations of P= (3)
TP + FP
keywords styles associated with various publishing societies. 2PR
Different societies use different characters to separate the F − measure = (4)
P+R
keywords like comma, semi-colon, line separation etc. In the
shown example, comma separation is detected. Similarly, where TP refers to the true positives or the number of rows
it shows the extracted abstract from the sample paper. to which the scheme correctly assigned the category it recog-
Next, all the extracted Headings of the research paper nizes (Title, Authors etc.). In the case when that category is
including main and sub-sections are depicted. This infor- incorrectly assigned a false positive (FP) is generated, while
mation can be used for generating table of contents against FN (false negatives) represents the number of rows for which
the document for better indexing, search, and retrieval. Here the scheme was not able to recognize the category it was
Fuzzy regular expression was helpful in identifying the main constructed for.
and sub-sections of any depth like section number 2.1.2.5.6.
C. DATASET
Further, it shows the extracted list of figures’ captions in
the input scientific document. The documents may contain For the evaluation of the proposed scheme, we used 500
arbitrary number of figures and the proposed scheme can published papers composed of journals, conferences, etc.
extract all the figures’ captions precisely. Again, various sourced from several scientific repositories. Table 5 shows
societies have different ways to present the figure captions, the dataset along the various sources and distribution. For
the proposed scheme not only extract the figures captions instance, the variations of 12 ACM data sources are given
for the underlying four societies (ACM, IEEE, Elsevier, and in Figure 7. The data set was divided for training and test-
Springer) but other societies’ papers can also be treated with ing phases as 70% and 30% division, respectively. Also,
the significant accuracy. Here Fuzzy REGEX can precisely the CORA dataset is used.
unify the tokens like ‘‘Figure’’, ‘‘Fig’’ etc. followed by a ‘.’, TABLE 5. Dataset used in the evaluation.
‘:’ and/or space.
Similar, approach is used for Table captions (next tab).
Similarly, it shows the extracted captions of the tables in
the paper. The tables’ caption formats vary from society to
society like word ‘table’ or ‘tab’, single line or multi-line
caption text, and various types of table numbering styles etc.
Likewise, it shows the extracted acknowledgement section of
the paper. It is normally an optional part in many societies.
It comes under different heading like ‘special thanks’, ‘fund-
ing agency’ and ‘acknowledgement’ while sometimes both D. COMPARISON
‘funding agency and acknowledgement’ etc. This section In this section the proposed approach is compared with sim-
carries the authors’ acknowledgements to a person, an orga- ilar approaches in the literature. Two types of comparisons
nization, and/or the funding agency etc. In case the paper are made, that are quantitative (in terms of numbers and
does not contain information, a null string will be returned. experimental outputs) and qualitative (in terms of the features
The length of acknowledgement section can be arbitrary in and dimensions of extraction fields).
this experiment. Most of the societies provide this section To make the comparison fair, we evaluated the proposed
and then it is up to the authors whether they utilize it or scheme and the other schemes using the datasets used in [10],
not. Finally, it shows the extracted references from the. It [13], [28]. Table 6 shows the precision (in percentage) in
is well-known fact there are several ways for references. training and testing phases for various sections of the paper
We have addressed all the well-known as well as the cus- separately. At the end aggregate averages of all the sections
tomized reference styles in the FREJ, like APA, IEEE etc. is calculated which are 89.14% and 91.21%, respectively for
The scheme is robust against several standard reference styles testing and training phases, respectively.
used in the target societies as well as many others in the
literature. Figure 6 shows the extracted information from a 1) QUANTITATIVE COMPARISON (EMPIRICAL)
sample document in a summarized form, duly collected from The comparison is made in terms of precision, F-score, and
the tabs shown in Figure 5. recall, and it is given in Table 7. The proposed scheme is better
in terms of precision, recall and F-score. However, the scheme information extraction rather that whole paper’s information.
is [10] has a performance closer to the proposed scheme but Nonetheless, the schemes in [13] and CERMINE [28] are
it is worth mentioning here, that it only focuses on tabular like the proposed one in terms of type of information to be
TABLE 6. Section based precision in testing and training phases. TABLE 8. Comparison with [28] for sub-sections.
[6] X. Ma, H. Qin, N. Sulaiman, T. Herawan, and J. H. Abawajy, [29] T. M. Dieb, M. Yoshioka, S. Hara, and M. C. Newton, ‘‘Framework
‘‘The parameter reduction of the interval-valued fuzzy soft sets and its for automatic information extraction from research papers on nanocrystal
related algorithms,’’ IEEE Trans. Fuzzy Syst., vol. 22, no. 1, pp. 57–71, devices,’’ Beilstein J. Nanotechnol., vol. 6, no. 1, pp. 1872–1882, 2015.
Feb. 2014. [30] X. Li, ‘‘The comparison of QlikView and tableau: A theoretical approach
[7] J. Azimjonov and J. Alikhanov, ‘‘Rule based metadata extraction frame- combined with practical experiencies,’’ M.S. thesis, Dept. Bus. Econ.,
work from academic articles,’’ 2018, arXiv:1807.09009. [Online]. Avail- Masters Manage., Univ. Hasselt, Hasselt, Belgium, 2014.
able: http://arxiv.org/abs/1807.09009 [31] M.-S. Chen, J. Han, and P. S. Yu, ‘‘Data mining: An overview from
[8] M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, ‘‘Evaluation a database perspective,’’ IEEE Trans. Knowl. Data Eng., vol. 8, no. 6,
of header metadata extraction approaches and tools for scientific PDF pp. 866–883, Dec. 1996.
documents,’’ in Proc. 13th ACM/IEEE-CS Joint Conf. Digit. Libraries [32] S. Jebbara and P. Cimiano, ‘‘Aspect-based sentiment analysis using a
(JCDL), 2013, pp. 385–386. two-step neural network architecture,’’ in Semantic Web Challenges.
[9] G. Zaman, H. Mahdin, and K. Hussain, ‘‘Information extraction from SemWebEval (Communications in Computer and Information
semi and unstructured data sources: A systematic literature review,’’ ICIC Science), vol. 641, H. Sack, S. Dietze, A. Tordai, and C. Lange,
Express Lett., vol. 14, no. 6, pp. 593–603, Jun. 2020. Eds. Cham, Switzerland: Springer, 2016. [Online]. Available:
[10] S. T. R. Rizvi, D. Mercier, S. Agne, S. Erkel, A. Dengel, and S. Ahmed, https://link.springer.com/chapter/10.1007/978-3-319-46565-4_12, doi:
‘‘Ontology-based information extraction from technical documents,’’ in 10.1007/978-3-319-46565-4_12.
Proc. 10th Int. Conf. Agents Artif. Intell., 2018, pp. 493–500. [33] S.-T. Kousta, D. P. Vinson, and G. Vigliocco, ‘‘Emotion words, regardless
[11] Atta-ur-Rahman, I. M. Qureshi, A. N. Malik, and M. T. Naseem, ‘‘QoS of polarity, have a processing advantage over neutral words,’’ Cognition,
and rate enhancement in DVB-S2 using fuzzy rule based system,’’ J. Intell. vol. 112, no. 3, pp. 473–481, Sep. 2009.
Fuzzy Syst., vol. 30, no. 2, pp. 801–810, Feb. 2016. [34] S. Sun, G. Kong, and C. Zhao, ‘‘Polarity words distance-weight
[12] Atta-ur-Rahman, I. M. Qureshi, A. N. Malik, and M. T. Naseem, ‘‘Dynamic count for opinion analysis of online news comments,’’ Procedia Eng.,
resource allocation in OFDM systems using DE and FRBS,’’ J. Intell. vol. 15, pp. 1916–1920, Dec. 2011. [Online]. Available: https://www.
Fuzzy Syst., vol. 26, no. 4, pp. 2035–2046, 2014. sciencedirect.com/science/article/pii/S1877705811018583?via%3Dihub
[13] R. Ahmad, M. T. Afzal, and M. A. Qadir, ‘‘Information extraction [35] A. Agarwal, F. Biadsy, and K. R. Mckeown, ‘‘Contextual phrase-level
from PDF sources based on rule-based system using integrated for- polarity analysis using lexical affect scoring and syntactic N-grams,’’ in
mats,’’ in Semantic Web Challenges. SemWebEval (Communications Proc. 12th Conf. Eur. Chapter Assoc. Comput. Linguistics (EACL), 2009,
in Computer and Information Science), vol. 641, H. Sack, S. Dietze, pp. 24–32.
A. Tordai, and C. Lange, Eds. Cham, Switzerland: Springer, 2016. [36] A. C. E. S. Lima, L. N. D. Castro, and J. M. Corchado, ‘‘A polarity
[Online]. Available: link: https://link.springer.com/chapter/10.1007/978- analysis framework for Twitter messages,’’ Appl. Math. Comput., vol. 270,
3-319-46565-4_23, doi: 10.1007/978-3-319-46565-4_23. pp. 756–767, Nov. 2015.
[14] K. Jayaram and K. Sangeeta, ‘‘A review: Information extraction tech- [37] M. Hu and B. Liu, ‘‘Mining and summarizing customer reviews,’’ in Proc.
niques from research papers,’’ in Proc. Int. Conf. Innov. Mech. Ind. Appl. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2004,
(ICIMIA), Feb. 2017, pp. 56–59. pp. 168–177.
[15] Z. Bodó and L. Csató, ‘‘A hybrid approach for scholarly information [38] A. Krouska, C. Troussas, and M. Virvou, ‘‘The effect of preprocessing
extraction,’’ Studia Univ. Babes-Bolyai, Inform., vol. 62, no. 2, pp. 5–16, techniques on Twitter sentiment analysis,’’ in Proc. 7th Int. Conf. Inf.,
2017. Intell., Syst. Appl. (IISA), Jul. 2016, pp. 1–5.
[16] P. Groth, M. Lauruhn, A. Scerri, and R. Daniel, ‘‘Open information extrac- [39] C. Jiang, J. Liu, D. Ou, Y. Wang, and L. Yu, ‘‘Implicit semantics based
tion on scientific text: An evaluation,’’ 2018, arXiv:1802.05574. [Online]. metadata extraction and matching of scholarly documents,’’ J. Database
Available: http://arxiv.org/abs/1802.05574 Manage., vol. 29, no. 2, pp. 1–22, Apr. 2018.
[17] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced [40] S. Khusro, A. Latif, and I. Ullah, ‘‘On methods and tools of table detection,
Approaches to Analyzing Unstructured Data, vol. 34, no. 1. Cambridge, extraction and annotation in PDF documents,’’ J. Inf. Sci., vol. 41, no. 1,
U.K.: Cambridge Univ. Press, 2007, pp. xii and 410. pp. 41–57, Feb. 2015.
[18] M. Ahmad and J. H. Abawajy, ‘‘Digital library service quality assessment [41] R. Kern and S. Klampfl, ‘‘Extraction of references using
model,’’ Procedia-Social Behav. Sci., vol. 129, pp. 571–580, May 2014. layout and formatting information from scientific articles,’’
[19] M. Safar, ‘‘Digital Library of Online PDF Sources: An ETL Approach,’’ D-Lib Mag., vol. 19, no. 9/10, Sep./Oct. 2013. [Online].
IJCSNS, vol. 20, no. 11, p. 173, 2020. Available: http://www.dlib.org/dlib/september13/kern/09kern.html, doi:
[20] M.-T. Luong, T. D. Nguyen, and M.-Y. Kan, ‘‘Logical structure recovery 10.1045/september2013-kern.
in scholarly articles with rich document features,’’ in Multimedia Storage [42] Z. Shuxin, X. Zhonghong, and C. Yuehong, ‘‘Information extraction from
and Retrieval Innovations for Digital Library Systems. Hershey, PA, USA: research papers based on conditional random field model,’’ Telkomnika
IGI Global, 2012, pp. 270–292. Indonesian J. Electr. Eng., vol. 11, no. 3, pp. 1213–1220, Mar. 2013.
[21] H. H. N. Do, M. K. Chandrasekaran, P. S. Cho, and M. Y. Kan, ‘‘Extract- [43] T. Groza, A. Grimnes, and S. Handschuh, ‘‘Reference information extrac-
ing and matching authors and affiliations in scholarly documents,’’ in tion and processing using random conditional fields,’’ Inf. Technol.
Proc. 13th ACM/IEEE-CS Joint Conf. Digit. Libraries (JCDL), 2013, Library, vol. 31, no. 2, pp. 6–20, 2012.
pp. 219–228. [44] A. Kovačević, D. Ivanović, B. Milosavljević, Z. Konjović, and
[22] S. Kim, Y. Cho, and K. Ahn, ‘‘Semi-automatic metadata extraction from D. Surla, ‘‘Automatic extraction of metadata from scientific publications
scientific journal article for full-text XML conversion,’’ in Proc. Int. Conf. for CRIS systems,’’ Program, vol. 45, no. 4, pp. 376–396, Sep. 2011.
Data Sci. (ICDATA), 2014, p. 1. [45] Q. Zhang, Y.-G. Cao, and H. Yu, ‘‘Parsing citations in biomedical articles
[23] A. Abd-Rashid, S. Abdul-Rahman, N. N. Yusof, and A. Mohamed, ‘‘Word using conditional random fields,’’ Comput. Biol. Med., vol. 41, no. 4,
sense disambiguation using fuzzy semantic-based string similarity model,’’ pp. 190–194, Apr. 2011.
Malaysian J. Comput., vol. 3, no. 2, pp. 154–161, 2018. [46] X. Zhang, J. Zou, D. X. Le, and G. R. Thoma, ‘‘A structural SVM
[24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ‘‘Distributed approach for reference parsing,’’ in Proc. 9th Int. Conf. Mach. Learn. Appl.,
representations of words and phrases and their compositionality,’’ in Proc. Dec. 2010, pp. 479–484.
Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119. [47] B. Ojokoh, M. Zhang, and J. Tang, ‘‘A trigram hidden Markov model for
[25] D. Allemang and J. Hendler, Semantic Web for the Working Ontologist: metadata extraction from heterogeneous references,’’ Inf. Sci., vol. 181,
Effective Modeling in RDFS and OWL. Amsterdam, The Netherlands: no. 9, pp. 1538–1551, May 2011.
Elsevier, 2011. [48] A. Prasad, M. Kaur, and M.-Y. Kan, ‘‘Neural ParsCit: A deep learning-
[26] J. Chen, C. Zhang, and Z. Niu, ‘‘A two-step resume information extraction based reference string parser,’’ Int. J. Digit. Libraries, vol. 19, no. 4,
algorithm,’’ Math. Problems Eng., vol. 2018, pp. 1–8, May 2018, doi: 10. pp. 323–337, Nov. 2018.
1155/2018/5761287. [49] K. U. Schulz and S. Mihov, ‘‘Fast string correction with Levenshtein
[27] R. Shah and S. Jain, ‘‘Ontology-based information extraction: An overview automata,’’ Int. J. Document Anal. Recognit., vol. 5, no. 1, pp. 67–85,
and a study of different approaches,’’ Int. J. Comput. Appl., vol. 87, no. 4, Nov. 2002.
pp. 6–8, Feb. 2014, doi: 10.5120/15194-3574. [50] Fuzzy Regular Expressions for Java—FREJ. Accessed: Dec. 25, 2019.
[28] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, and Ł. Bolikowski, [Online]. Available: http://frej.sourceforge.net/javadocs/index.html
‘‘CERMINE: Automatic extraction of structured metadata from scientific [51] Atta-ur-Rahman, S. Dash, A. K. Luhach, N. Chilamkurti, S. Baek, and
literature,’’ Int. J. Document Anal. Recognit., vol. 18, no. 4, pp. 317–335, Y. Nam, ‘‘A neuro-fuzzy approach for user behaviour classification and
Dec. 2015. prediction,’’ J. Cloud Comput., vol. 8, no. 1, p. 17, Dec. 2019.