2020ICKIIPredatoryJournalClassificationUsingMachineLearning

This conference paper discusses the use of machine learning to classify predatory journals, which pose a threat to academic integrity and research quality. The authors developed a classification system that utilizes text content and keywords from journal websites to distinguish between legitimate and predatory journals, achieving a recall rate exceeding 90%. The research aims to provide scholars with a reliable method to identify predatory journals and avoid submitting their work to such publishers.

Uploaded by

kozasurche

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

2020ICKIIPredatoryJournalClassificationUsingMachineLearning

Uploaded by

kozasurche

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350080601

Predatory Journal Classiﬁcation Using Machine Learning

Conference Paper · August 2020

DOI: 10.1109/ICKII50300.2020.9318901

CITATIONS READS

0 305

4 authors, including:

Li-Xian Chen Chia-Hung Liao

Fuzhou University of International Studies and Trade National Chiao Tung University
4 PUBLICATIONS 75 CITATIONS 13 PUBLICATIONS 56 CITATIONS

SEE PROFILE SEE PROFILE

Shyan-Ming Yuan
National Yang Ming Chiao Tung University
352 PUBLICATIONS 3,422 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Agent Based Interconnected cloud View project

Applying the BlockChain technique to distributed applications development View project

All content following this page was uploaded by Chia-Hung Liao on 18 March 2021.

The user has requested enhancement of the downloaded file.

3rd IEEE International Conference on Knowledge Innovation and Invention 2020

Predatory Journal Classification Using Machine Learning

Li-Xian Chen1*, Kai-Sin Wong2, Chia-Hung Liao2, Shyan-Ming Yuan2**

1
School of Technology, Fuzhou University of International Studies and Trade, Fuzhou 350202, China
2
Department of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan
Corresponding Author: Email: *lixian.cs98g@g2.nctu.edu.tw, **smyuan@cs.nctu.edu.tw

Abstract transparency and correct practice in scholarly publishing.

However, the list is checked manually and confirmed by
The prevalence of predatory journals has become more library or scholars’ communities. The purpose of this research
severe recently as this is harmful to science and technology is to help scholars quickly distinguish whether the journal
2020 3rd IEEE International Conference on Knowledge Innovation and Invention (ICKII) | 978-1-7281-9333-5/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICKII50300.2020.9318901

development. For scholars publish papers more effectively and submitted is predatory.
avoid publishers for profits, this research used a machine This study uses a machine learning classification to design a
learning method to identify the predatory journals. The convenient method for predatory journal classification.
features like text content and keywords of the collected Although some websites provide blacklists for people to check
journals’ websites were extracted from mainstream predatory out a predatory journal, predatory journals change their
journal websites and normal journal websites. This research journals’ names or their URLs to evade it from the blacklist.
proposed a predatory journal classification system based on a Machine learning classification is an effective supervised
new model. The results show that our model’s recall rate method in which a computer model learns from the training
exceeds 90%, ensuring that the journals submitted by the data and makes new classifications. First, the data mining was
researchers are not predatory. used to find out the content of journal websites. Second,
TF-IDF found out the keywords between normal journals and
Keywords: predatory journals, classifier, machine learning predatory journals, and then using bag of word (BoW) method
to collect all the content of the journal websites [5]. Finally,
Introduction this research built some models for distinguishing predatory
journal websites from legitimate ones by machine learning
Predatory journals not only corrupt open access, but also methods.
undermine scientific research and academic development in
the long term [1]. Through the academic journal publication, Related Work
scholars grasp the discipline development trends and the latest
research topics in the shortest time. They also communicate Predatory journal websites present convincing policies and
with other researchers by reviewing manuscripts and providing practices of normal journals. The websites convince scholars
valuable comments. However, the operation and editorial of their peer review processes, a seemingly academic editorial
quality of academic journals need a lot of manpower and funds board, and a high impact factor [1,6]. Predatory journals
to maintain. These make the subscription costs of academic include similar journal titles like “International Journal of
journals increase year by year and are gradually monopolized Clinical and Experimental Ophthalmology” and pseudo
by large publishers. After the 2000s, open access (OA) scholar information [4]. Librarians pay attention to predatory
journals encourage scholars to publish their research papers by publishing, and researchers should avoid submitting papers to
distributing them online for free [2]. Early open access predatory journals. Beall was the first librarian to propose
publishing changed the academic publishing way and reduced predatory journals and developed a “Beall’s List” based on the
costs, expanding the access to the latest research in the global 52 criteria he established for researchers to refer [1, 7]. The list
academic community [1]. With the popularization of open provides scholars to identify and share the information about
access, journal fee payers have changed from readers to publishers that are deceived, lack transparency, or do not
scholars. Some unscrupulous publishers use this method to follow legal published standards.
deceive scholars who urgently need to publish papers. Due to There are some researches conducting semantic analysis
the lack of rigorous peer reviews of predatory journals, the from the websites’ content because predatory journal websites
quality of published articles gradually declines. Additionally, or publishers have some specific descriptions to attract the
as more and more predatory journals appear, many scholars' scholars’ attention. Linguistic inquiry and word count (LIWC)
papers are plagiarized and published in other journals. This system was used to classify 81 categories that include
phenomenon caused original valuable papers to be regarded as cognitive and emotional psychology [6]. Statistical
worthless papers. classification of LIWC analyzed the predictive effect of the
One way to help scholars distinguish between predatory language features for a piece of text as predatory or normal.
journals and normal ones includes Beall’s list and directory of The results showed that predatory publishers’ websites used
open access journals (DOAJ) [3]. Beall’ list provides a not only more auxiliary verbs (e.g., “should” “would”) and
probable predatory journal list with the inspected criteria. The positive emotions (e.g., “congratulation”), but fewer function
criteria include accepting questionable scientific papers, words like quantifiers, articles, and adverbial conjunctions.
conducting false peer reviews, and charging high publication Other researcher compared the blacklists of Beall’s list and
fees without providing reasonable editing services [1,4]. Cabell’s list with the whitelists like DOAJ, Open Access
DOAJ contains 14,299 journals that follow the principles of Scholarly Publishers Association (OASPA), and PubMed.

ISBN: 978-1-7281-9333-5 193

Authorized licensed use limited to: National Chiao Tung Univ.. Downloaded on March 08,2021 at 14:33:35 UTC from IEEE Xplore. Restrictions apply.
3rd IEEE International Conference on Knowledge Innovation and Invention 2020

Then, they investigated the features of predatory journals like features [16]. To reduce the impact of the data training, we
journal-title, address of the publisher, contact email addresses converted all the text to lowercase.
[8]. For example, predatory journals use similar journal titles B. Features Extraction
to deceive scholars, provide forged addresses from the U.S. or Bag-of-Words (BoW) model was widely used to find the
U.K., and utilize open and free email services like Gmail and document and text pattern in the information retrieval field.
Yahoo. The content and semantics of predatory journal BoW only considers that the words exist on the file and the
websites are worth analyzing whether there is a possibility of words’ order. Grammar is not considered. The model used a
predatory journals. However, these features are not absolute dictionary with word sets to convert the comparison result to a
criteria. number. For example, for the sentence is “Ken and Bob all like
Classification is a model based on known data and its label to travel.”, the dictionary has each word such as 1: “Ken,” 2:
attributes in machine learning [9-11]. According to this model, “and,” 3: “Bob,” 4: “all,” 5: “like,” 6: “to,” and 7: “travel”. The
we can understand the features of data belonging to various sentence is converted into (1,1,1,1,1,1,1). After data
labels of attributes and predict new data that belong to which pre-processing, this study collected each word from predatory
label. A well-known method of detecting phishing uses journals and normal journals into word sets. Figure 1 shows
features representation of the website and machine learning to how to collect word sets.
classify the website [12]. Phishing activities imitate legitimate
science journal websites (e.g., Wulfenia and Archives des
Science) to cheat researchers to pay author fees. The
researchers used the TF-IDF algorithm to calculate the URLs’
characteristics and the website content. They took various
classification algorithms such as naive bayes (NB), support
vector machine (SVM), and random forest (RF) to detect if the
websites are phishing or not. The classifier system had 98.8%
accuracy in detecting phishing. Another research used
heuristics-based and text-based feature representations to
classify predatory journals [13]. They found that text-based
features were more accessible to extract than heuristics-based.
Based on these researches, we considered that the predatory
journal websites content has similar characteristics like Fig. 1 Example of word set
phishing websites. In this study, we used mainstream machine
learning methods to classify predatory journal websites. Term frequency - inverse document frequency (TF-IDF) is
taken to evaluate the word weighting in the dataset. Term
Methodology frequency is the word rate of the document.
A. Data Preprocessing 𝑞𝑞𝑞𝑞𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗
This research collected the uniform resource locators (URL) 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 = ∑𝑘𝑘𝑘𝑘 𝑞𝑞𝑞𝑞𝑘𝑘𝑘𝑘,𝑗𝑗𝑗𝑗
, (1)
from the normal journal and predatory journal websites. The
blacklists (predatory journals) were provided from credible where 𝑞𝑞𝑞𝑞𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 presents quantity of “word i” in “document j (a
lists such as ‘Beall’s list’ and ‘Stop Predatory Journals’ [7, 14]. whole predatory/normal journal)”, and ∑𝑘𝑘𝑘𝑘 𝑞𝑞𝑞𝑞𝑘𝑘𝑘𝑘,𝑗𝑗𝑗𝑗 is all words
With long-term monitoring, global librarians’ review, and quantity in “document j.”
different certified criteria, these were confirmed as potential
Inverse Document Frequency (IDF) is a measure of how
predatory journals. The whitelist collected from BIH QUEST
much information the word provides.
contains normal open access journals listed on the DOAJ and
Pubmed central [3]. 𝐷𝐷𝐷𝐷
The web crawling method was used to get the links of 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = log( ), (2)
𝑑𝑑𝑑𝑑𝑡𝑡𝑡𝑡
predatory and normal journals [12, 15]. Firstly, manual review
technology was adopted to confirm whether these links are where D is the amount of all documents, and 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 is the amount
legitimate. After the link processing, we got two sets of 833 of document which contains word t. Therefore, TF-IDF is to
predatory journals and 1213 normal journals from the websites. calculate the score of each word for each document through TF
Each journal set included 80% for the training set and 20% for and IDF. The score is defined as
the testing set.
To extract text features for machine learning training, there 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡𝑡𝑡,𝑑𝑑𝑑𝑑 = 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖,𝑗𝑗𝑗𝑗 ∗ 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . (3)
are several text mining steps for data preprocessing. First, the
fetched information such as script, HTML tag, labels and CSS After we obtained the score of “word t” in a predatory
need to be filtered out. Stop words are removed in natural journal and a normal journal and we defined the difference
language preprocessing and these should be removed before 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 as
any of feature extraction operation. For example, the words
“the,” “is,” “at,” “which,” and “on” are meaningless in a 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑑𝑑𝑑𝑑 𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑝𝑝𝑝𝑝𝑤𝑤𝑤𝑤𝑝𝑝𝑝𝑝𝑑𝑑𝑑𝑑𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑝𝑝𝑝𝑝 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑑𝑑𝑑𝑑 𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑛𝑛𝑛𝑛𝑝𝑝𝑝𝑝𝑛𝑛𝑛𝑛 . (4)
sentence. Then, punctuations in sentences also need to be
removed. Finally, stemming distinguishes the root of words The top 20 words became the feature words by ranking each
such as pass, passed and passing. All are stemmed to the word word’s 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 score. The subset is shown in Table 1.
“pass” and the stemmed words have correctly matching

194 ISBN: 978-1-7281-9333-5

After we got the 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 score as our features, we converted as test materials. Different amounts of feature words (50-9000
journals’ text into a vector. This research created a 1 x N vector words) were used to predict four models.
for each journal. If word i existed in top N words appears in A The research tested various values (i.e., between interval
journal, vector[i] was set as 1. For example, if the top 5 words 5-9000) to decide the optimal subset size for Gaussian Naïve
are “apple, car, pen, hand, banana,” and content of journal A is Bayes, KNN, Random Forest and SVM. In the Gaussian Naïve
“love, apple, banana, sleep, cook,” then journal A’s word Bayes method, when the number of word features (NUM) was
vector is [1,0,0,0,1]. Finally, we can get the word vector of 8450, the recall rate was 0.89. F1-score was 0.75 when the
each journal and starting our model training. NUM was 3700. In the Random Forest approach, we obtained
the highest recall rate 0.982 with NUM of 850, and the
TABLE 1 F1-score was 0.98 when NUM was 1200. After we used SVM
TOP 20 FEATURES WORDS BASED ON 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑡𝑡𝑡𝑡 SCORE to train model, we obtained the highest recall 0.952 when
Rank Feature Rank Feature NUM was 350 and F1-score 0.934 when NUM was 2400.
1 journal 11 biochemistry KNN takes K minimum distance of data and counts the amount
2 issue 12 engineering of label which data belong to. If K equals the number of the
3 international 13 index training instances, the result of prediction is the maximum
4 volume 14 review number of the label. When K equals to 4 with the lowest error
5 paper 15 doi
rate 0.065, it was used to be KNN’s neighbor. This classifier
6 research 16 biology
obtained the highest recall 0.96 when NUM was 3000, and the
7 science 17 molecular
highest F1-score 0.93 when NUM was 500. The classifier’s
8 factor 18 peer
9 impact 19 submission
results are shown in Fig. 2.
10 publication 20 field

C. Classifier
Classification algorithms are vital to evaluate the features
predictive ability. Each classifier has a different approach to
datasets. In this research, we used four most influential
classification algorithms to analyze the predatory journals:
Naïve Bayes, K Nearest Neighbor (KNN), Random Forest, and
Support Vector Machine (SVM) [9-11, 17]. Naïve Bayes is a
(a) Gaussian Naïve
simple technique for constructing classifiers. It supposes that
every feature is independent and assigns class labels to given
instances represented by its feature vector. This research took
the Gaussian Naïve Bayes method for the experiment. SVM is
a supervised learning method used the principle of statistical
risk minimization to estimate a classified hyperplane. SVM
finds a decision boundary to make the boundary between the
two types (labels) maximized so that two labels can be
distinguished correctly. Taking two labels, ‘weight’ and
‘refractometer’ as examples, the words are set as x-axis and (b) KNN
y-axis and then a line is found to separate the points into two
areas to judge whether it is a normal or predatory journal. This
classifies the category to which journals belongs. KNN
calculates the distance between the target data and each data.
Then, KNN considers K minimum distance of data and counts
the amount of label which data belong to. Finally, it predicts
target data to the maximum number of labels. Random Forest
method combines multiple decision trees and adds randomly
assigned training data to improve the final calculation result
significantly. Ensemble Learning combined various weak
models to construct a stable and robust model, and this model (c) Random Forest
is not biased or overfitting. Then, decision tree voting decides
which label data belonged to.

Experiment and Result

A. Results
Parameter optimization is needed to adjust for each
classifier. This research performed data pre-processing and
features extraction. Then, we randomly selected 80% of the
journals as training data (666 predatory journals and 970 (d) SVM
normal journals). The remaining 20% of the journals were used Fig. 2 Feature Selection process of the optimized BoW

ISBN: 978-1-7281-9333-5 195

Australasian Joint Conference on Artificial Intelligence, 2004,

B. Evaluation pp. 488-499.
Reference [13] indicated that heuristic features are better [10] M. Pal, "Random forest classifier for remote sensing
than BoW classification. Comparing to the result of the classification," International journal of remote sensing, vol. 26,
pp. 217-222, 2005.
previous research, there were more features in this research [11] J. A. Suykens and J. Vandewalle, "Least squares support vector
than the training data set and testing data set [13]. In the BoW machine classifiers," Neural processing letters, vol. 9, pp.
feature method, the random forest model’s F1-score 0.98 was 293-300, 1999.
higher than any model [13]. The results are shown in Table 2. [12] H. H. Nguyen and D. T. Nguyen, "Machine learning based
phishing web sites detection," in AETA 2015: recent advances
TABLE 2 in electrical engineering and related sciences, ed: Springer,
CLASSIFIER’S F1-SCORE COMPARSION 2016, pp. 123-131.
Method Heuristic BoW Our BoW [13] A. Adnan, S. Anwar, T. Zia, S. Razzaq, F. Maqbool, and Z. U.
features Result Rehman, "Beyond Beall's Blacklist: Automatic Detection of
Naïve Bayes 0.95 0.89 0.75 Open Access Predatory Research Journals," in 2018 IEEE 20th
KNN 0.94(K=3) 0.93(K=3) 0.93(k=4) International Conference on High Performance Computing
and Communications; IEEE 16th International Conference on
SVM 0.98 0.96 0.93
Smart City; IEEE 4th International Conference on Data
Random Forest X X 0.98
Science and Systems (HPCC/SmartCity/DSS), 2018, pp.
1692-1697.
Conclusion [14] S. P. Journals. (2020). Stop Predatory Journals: List of
Predatory Journals. Available:
This paper found out that the predatory website is similar to https://predatoryjournals.com/journals/
a normal website. Although heuristic features are better than [15] K. Wu, S. Chou, S. Chen, C. Tsai, and S. Yuan, "Application of
BoW [13], the BoW results in this research showed similar machine learning to identify Counterfeit Website," in
results with Ref. [13]. Except for Gaussian Naïve Bayes, the Proceedings of the 20th International Conference on
Information Integration and Web-based Applications &
classifiers in this research showed the recall rate and F1-score
Services, 2018, pp. 321-324.
of 90%. The most important feature of BoW is acquiring [16] S. Vijayarani, M. J. Ilamathi, and M. Nithya, "Preprocessing
keywords. If the classifiers used unnecessary words as the techniques for text mining-an overview," International Journal
features, the recall rate would be reduced greatly. In addition, of Computer Science & Communication Networks, vol. 5, pp.
when the number of keywords is too high or low, it would 7-16, 2015.
affect the prediction results. To enhance the ability to obtain [17] S. A. Dudani, "The distance-weighted k-nearest-neighbor
important keywords, we used a different method of TF-IDF. rule," IEEE Transactions on Systems, Man, and Cybernetics,
We found out that the number of keywords between 1000 to pp. 325-327, 1976.
3000 yielded the highest recall rate and F1-score and the best
performance on the Random forest model. This classification
model can be designed in the predatory journal detection
system in the future.

References

[1] J. Beall, "Predatory publishers are corrupting open access,"

Nature, vol. 489, pp. 179-179, 2012.
[2] B.-C. Björk, "A study of innovative features in scholarly open
access journals," Journal of Medical Internet Research, vol. 13,
p. e115, 2011.
[3] B. QUEST. Open Access Journal Positive List. Available:
http://s-quest.bihealth.org:3838/OAWhitelist/
[4] M. Ojala, R. Reynolds, and K. G. Johnson, "Predatory Journal
Challenges and Responses," The Serials Librarian, pp. 1-6,
2020.
[5] N. Kejriwal, S. Kumar, and T. Shibata, "High performance
loop closure detection using bag of word pairs," Robotics and
Autonomous Systems, vol. 77, pp. 55-65, 2016.
[6] D. Markowitz, J. Powell, and J. T. Hancock, "The writing style
of predatory publishers," in ASEE Annual Conference and
Exposition, Indianapolis, IN, 2014.
[7] J. Beall. (2020). Beall's list of potential predatory journals and
publishers. Available:
https://beallslist.net/standalone-journals/
[8] F. A. Rathore and A. R. Memon, "How to detect and avoid
predatory journals," Scientific writing: A guide to the art of
medical writing and scientific publishing. Karachi:
Professional Medical Publications, pp. 312-25, 2018.
[9] A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes,
"Multinomial naive bayes for text categorization revisited," in

196 ISBN: 978-1-7281-9333-5

Authorized
View publication stats licensed use limited to: National Chiao Tung Univ.. Downloaded on March 08,2021 at 14:33:35 UTC from IEEE Xplore. Restrictions apply.