2020ICKIIPredatoryJournalClassificationUsingMachineLearning
2020ICKIIPredatoryJournalClassificationUsingMachineLearning
net/publication/350080601
CITATIONS READS
0 305
4 authors, including:
Shyan-Ming Yuan
National Yang Ming Chiao Tung University
352 PUBLICATIONS 3,422 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Chia-Hung Liao on 18 March 2021.
development. For scholars publish papers more effectively and submitted is predatory.
avoid publishers for profits, this research used a machine This study uses a machine learning classification to design a
learning method to identify the predatory journals. The convenient method for predatory journal classification.
features like text content and keywords of the collected Although some websites provide blacklists for people to check
journals’ websites were extracted from mainstream predatory out a predatory journal, predatory journals change their
journal websites and normal journal websites. This research journals’ names or their URLs to evade it from the blacklist.
proposed a predatory journal classification system based on a Machine learning classification is an effective supervised
new model. The results show that our model’s recall rate method in which a computer model learns from the training
exceeds 90%, ensuring that the journals submitted by the data and makes new classifications. First, the data mining was
researchers are not predatory. used to find out the content of journal websites. Second,
TF-IDF found out the keywords between normal journals and
Keywords: predatory journals, classifier, machine learning predatory journals, and then using bag of word (BoW) method
to collect all the content of the journal websites [5]. Finally,
Introduction this research built some models for distinguishing predatory
journal websites from legitimate ones by machine learning
Predatory journals not only corrupt open access, but also methods.
undermine scientific research and academic development in
the long term [1]. Through the academic journal publication, Related Work
scholars grasp the discipline development trends and the latest
research topics in the shortest time. They also communicate Predatory journal websites present convincing policies and
with other researchers by reviewing manuscripts and providing practices of normal journals. The websites convince scholars
valuable comments. However, the operation and editorial of their peer review processes, a seemingly academic editorial
quality of academic journals need a lot of manpower and funds board, and a high impact factor [1,6]. Predatory journals
to maintain. These make the subscription costs of academic include similar journal titles like “International Journal of
journals increase year by year and are gradually monopolized Clinical and Experimental Ophthalmology” and pseudo
by large publishers. After the 2000s, open access (OA) scholar information [4]. Librarians pay attention to predatory
journals encourage scholars to publish their research papers by publishing, and researchers should avoid submitting papers to
distributing them online for free [2]. Early open access predatory journals. Beall was the first librarian to propose
publishing changed the academic publishing way and reduced predatory journals and developed a “Beall’s List” based on the
costs, expanding the access to the latest research in the global 52 criteria he established for researchers to refer [1, 7]. The list
academic community [1]. With the popularization of open provides scholars to identify and share the information about
access, journal fee payers have changed from readers to publishers that are deceived, lack transparency, or do not
scholars. Some unscrupulous publishers use this method to follow legal published standards.
deceive scholars who urgently need to publish papers. Due to There are some researches conducting semantic analysis
the lack of rigorous peer reviews of predatory journals, the from the websites’ content because predatory journal websites
quality of published articles gradually declines. Additionally, or publishers have some specific descriptions to attract the
as more and more predatory journals appear, many scholars' scholars’ attention. Linguistic inquiry and word count (LIWC)
papers are plagiarized and published in other journals. This system was used to classify 81 categories that include
phenomenon caused original valuable papers to be regarded as cognitive and emotional psychology [6]. Statistical
worthless papers. classification of LIWC analyzed the predictive effect of the
One way to help scholars distinguish between predatory language features for a piece of text as predatory or normal.
journals and normal ones includes Beall’s list and directory of The results showed that predatory publishers’ websites used
open access journals (DOAJ) [3]. Beall’ list provides a not only more auxiliary verbs (e.g., “should” “would”) and
probable predatory journal list with the inspected criteria. The positive emotions (e.g., “congratulation”), but fewer function
criteria include accepting questionable scientific papers, words like quantifiers, articles, and adverbial conjunctions.
conducting false peer reviews, and charging high publication Other researcher compared the blacklists of Beall’s list and
fees without providing reasonable editing services [1,4]. Cabell’s list with the whitelists like DOAJ, Open Access
DOAJ contains 14,299 journals that follow the principles of Scholarly Publishers Association (OASPA), and PubMed.
Then, they investigated the features of predatory journals like features [16]. To reduce the impact of the data training, we
journal-title, address of the publisher, contact email addresses converted all the text to lowercase.
[8]. For example, predatory journals use similar journal titles B. Features Extraction
to deceive scholars, provide forged addresses from the U.S. or Bag-of-Words (BoW) model was widely used to find the
U.K., and utilize open and free email services like Gmail and document and text pattern in the information retrieval field.
Yahoo. The content and semantics of predatory journal BoW only considers that the words exist on the file and the
websites are worth analyzing whether there is a possibility of words’ order. Grammar is not considered. The model used a
predatory journals. However, these features are not absolute dictionary with word sets to convert the comparison result to a
criteria. number. For example, for the sentence is “Ken and Bob all like
Classification is a model based on known data and its label to travel.”, the dictionary has each word such as 1: “Ken,” 2:
attributes in machine learning [9-11]. According to this model, “and,” 3: “Bob,” 4: “all,” 5: “like,” 6: “to,” and 7: “travel”. The
we can understand the features of data belonging to various sentence is converted into (1,1,1,1,1,1,1). After data
labels of attributes and predict new data that belong to which pre-processing, this study collected each word from predatory
label. A well-known method of detecting phishing uses journals and normal journals into word sets. Figure 1 shows
features representation of the website and machine learning to how to collect word sets.
classify the website [12]. Phishing activities imitate legitimate
science journal websites (e.g., Wulfenia and Archives des
Science) to cheat researchers to pay author fees. The
researchers used the TF-IDF algorithm to calculate the URLs’
characteristics and the website content. They took various
classification algorithms such as naive bayes (NB), support
vector machine (SVM), and random forest (RF) to detect if the
websites are phishing or not. The classifier system had 98.8%
accuracy in detecting phishing. Another research used
heuristics-based and text-based feature representations to
classify predatory journals [13]. They found that text-based
features were more accessible to extract than heuristics-based.
Based on these researches, we considered that the predatory
journal websites content has similar characteristics like Fig. 1 Example of word set
phishing websites. In this study, we used mainstream machine
learning methods to classify predatory journal websites. Term frequency - inverse document frequency (TF-IDF) is
taken to evaluate the word weighting in the dataset. Term
Methodology frequency is the word rate of the document.
A. Data Preprocessing 𝑞𝑞𝑞𝑞𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗
This research collected the uniform resource locators (URL) 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 = ∑𝑘𝑘𝑘𝑘 𝑞𝑞𝑞𝑞𝑘𝑘𝑘𝑘,𝑗𝑗𝑗𝑗
, (1)
from the normal journal and predatory journal websites. The
blacklists (predatory journals) were provided from credible where 𝑞𝑞𝑞𝑞𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 presents quantity of “word i” in “document j (a
lists such as ‘Beall’s list’ and ‘Stop Predatory Journals’ [7, 14]. whole predatory/normal journal)”, and ∑𝑘𝑘𝑘𝑘 𝑞𝑞𝑞𝑞𝑘𝑘𝑘𝑘,𝑗𝑗𝑗𝑗 is all words
With long-term monitoring, global librarians’ review, and quantity in “document j.”
different certified criteria, these were confirmed as potential
Inverse Document Frequency (IDF) is a measure of how
predatory journals. The whitelist collected from BIH QUEST
much information the word provides.
contains normal open access journals listed on the DOAJ and
Pubmed central [3]. 𝐷𝐷𝐷𝐷
The web crawling method was used to get the links of 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = log( ), (2)
𝑑𝑑𝑑𝑑𝑡𝑡𝑡𝑡
predatory and normal journals [12, 15]. Firstly, manual review
technology was adopted to confirm whether these links are where D is the amount of all documents, and 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 is the amount
legitimate. After the link processing, we got two sets of 833 of document which contains word t. Therefore, TF-IDF is to
predatory journals and 1213 normal journals from the websites. calculate the score of each word for each document through TF
Each journal set included 80% for the training set and 20% for and IDF. The score is defined as
the testing set.
To extract text features for machine learning training, there 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡,𝑑𝑑𝑑𝑑 = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 ∗ 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . (3)
are several text mining steps for data preprocessing. First, the
fetched information such as script, HTML tag, labels and CSS After we obtained the score of “word t” in a predatory
need to be filtered out. Stop words are removed in natural journal and a normal journal and we defined the difference
language preprocessing and these should be removed before 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 as
any of feature extraction operation. For example, the words
“the,” “is,” “at,” “which,” and “on” are meaningless in a 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑑𝑑𝑑𝑑 𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑤𝑤𝑤𝑤𝑝𝑝𝑝𝑝𝑑𝑑𝑑𝑑𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑝𝑝𝑝𝑝 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑑𝑑𝑑𝑑 𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑛𝑛𝑛𝑛𝑝𝑝𝑝𝑝𝑛𝑛𝑛𝑛 . (4)
sentence. Then, punctuations in sentences also need to be
removed. Finally, stemming distinguishes the root of words The top 20 words became the feature words by ranking each
such as pass, passed and passing. All are stemmed to the word word’s 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 score. The subset is shown in Table 1.
“pass” and the stemmed words have correctly matching
After we got the 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 score as our features, we converted as test materials. Different amounts of feature words (50-9000
journals’ text into a vector. This research created a 1 x N vector words) were used to predict four models.
for each journal. If word i existed in top N words appears in A The research tested various values (i.e., between interval
journal, vector[i] was set as 1. For example, if the top 5 words 5-9000) to decide the optimal subset size for Gaussian Naïve
are “apple, car, pen, hand, banana,” and content of journal A is Bayes, KNN, Random Forest and SVM. In the Gaussian Naïve
“love, apple, banana, sleep, cook,” then journal A’s word Bayes method, when the number of word features (NUM) was
vector is [1,0,0,0,1]. Finally, we can get the word vector of 8450, the recall rate was 0.89. F1-score was 0.75 when the
each journal and starting our model training. NUM was 3700. In the Random Forest approach, we obtained
the highest recall rate 0.982 with NUM of 850, and the
TABLE 1 F1-score was 0.98 when NUM was 1200. After we used SVM
TOP 20 FEATURES WORDS BASED ON 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 SCORE to train model, we obtained the highest recall 0.952 when
Rank Feature Rank Feature NUM was 350 and F1-score 0.934 when NUM was 2400.
1 journal 11 biochemistry KNN takes K minimum distance of data and counts the amount
2 issue 12 engineering of label which data belong to. If K equals the number of the
3 international 13 index training instances, the result of prediction is the maximum
4 volume 14 review number of the label. When K equals to 4 with the lowest error
5 paper 15 doi
rate 0.065, it was used to be KNN’s neighbor. This classifier
6 research 16 biology
obtained the highest recall 0.96 when NUM was 3000, and the
7 science 17 molecular
highest F1-score 0.93 when NUM was 500. The classifier’s
8 factor 18 peer
9 impact 19 submission
results are shown in Fig. 2.
10 publication 20 field
C. Classifier
Classification algorithms are vital to evaluate the features
predictive ability. Each classifier has a different approach to
datasets. In this research, we used four most influential
classification algorithms to analyze the predatory journals:
Naïve Bayes, K Nearest Neighbor (KNN), Random Forest, and
Support Vector Machine (SVM) [9-11, 17]. Naïve Bayes is a
(a) Gaussian Naïve
simple technique for constructing classifiers. It supposes that
every feature is independent and assigns class labels to given
instances represented by its feature vector. This research took
the Gaussian Naïve Bayes method for the experiment. SVM is
a supervised learning method used the principle of statistical
risk minimization to estimate a classified hyperplane. SVM
finds a decision boundary to make the boundary between the
two types (labels) maximized so that two labels can be
distinguished correctly. Taking two labels, ‘weight’ and
‘refractometer’ as examples, the words are set as x-axis and (b) KNN
y-axis and then a line is found to separate the points into two
areas to judge whether it is a normal or predatory journal. This
classifies the category to which journals belongs. KNN
calculates the distance between the target data and each data.
Then, KNN considers K minimum distance of data and counts
the amount of label which data belong to. Finally, it predicts
target data to the maximum number of labels. Random Forest
method combines multiple decision trees and adds randomly
assigned training data to improve the final calculation result
significantly. Ensemble Learning combined various weak
models to construct a stable and robust model, and this model (c) Random Forest
is not biased or overfitting. Then, decision tree voting decides
which label data belonged to.
A. Results
Parameter optimization is needed to adjust for each
classifier. This research performed data pre-processing and
features extraction. Then, we randomly selected 80% of the
journals as training data (666 predatory journals and 970 (d) SVM
normal journals). The remaining 20% of the journals were used Fig. 2 Feature Selection process of the optimized BoW
References