Analysis of Anemia Using Data Mining Techniques With Risk Factors Specification
Abstract— Deficiency in healthy Red Blood Cells (RBC) metric. In the least development countries, anemia is
leads to insufficient oxygen to be carried to whole blood tissues. widespread especially in children and pregnant women, as in
Many reasons cause such an issue like iron or vitamin Malawi. The need of anemia prediction system and due to
deficiency which is known as Anemia. Pregnant women, the cost of such a common system, researchers suggested a
children under the age of 6, people with a low vitamin diet and low cost prediction [4]. Testing cost was 1.00$ per patient,
losing their blood due to surgery or injury are at risk that will but in this paper researchers demonstrate a spectra method
tend to have anemia. Such a disease can be diagnosed by blood which minimized the prices per patient. Such a device of
test called Complete Blood Count (CBC), which evaluates prediction disease take researchers interests to design an
Hemoglobin levels of patient’s blood. Undiagnosed or
early detector [5]. By defining impedance analysis and
untreated left disease, such as anemia, can cause health
problems such as severe fatigue and pregnancy complications.
relying on hematocrit analysis, this device works with total
Different types of anemia, especially those associated with iron patient samples. Nnumerical techniques based on the radial
or vitamin deficiency, can be ameliorated, especially when basis functions are presented as a solution of anemia
detected at an early stage. In this paper, four techniques, treatment [6].
Bayesian Network (BN), Naive Bayes (NB), Logistic Regression Rare forms and counting representations of some red
(LR) and Multilayer Perceptron (MLP) have been applied to blood cells are essential for iron deficiency anemia
predict anemia based on 539 data, with 10 attributes, collected
recognized by three different classifiers [7]. Due to some
from laboratories. The LR has given better results compared
researchers [8], data mining approaches are able to classify
to other considered techniques. In addition, attribute
evaluators such as information gain have been applied to
two types of anemia but unsuccessful to predict the reason of
demonstrate the high performance of the system with such issues. Artificial Neural Network (ANN) also
minimum characteristics. participates in anemia prediction works to classify the RBC
visual samples [9], [10] and [11]. Same approaches with a
Keywords— Anemia, Bayes Network, Naive Bayes, Logistic different processing techniques were also suggested [12]
Regression, Multi-Layer Perceptron using the Laplacian of Gaussian filters to identify anemia in
an early stage depending also on counts and shapes. Lower
amount of hemoglobin intensity in blood samples represents
I. INTRODUCTION an aspect due to anemia ailment. Three classes of eyes and
Recently, the most difficult challenge which healthcare tongue image samples are analyzed to present a classification
and health institutes are suffering from is the early detection model based on the image feature extraction [13]. Anemia
of dangerous disorders that lead to complicated health symptoms may be unknown without detection or
problems. Medical data can be assembled from different identification of RBC count in blood samples. Such an issue
sources like images, laboratories or any other different is complicated to prove it occurrences with single
source types. Working with such an unstructured data need parameters, therefore; studies such as the study [14]
techniques to be mined carefully and extracted patterns to depends on other parameters not only RBC count such as
perform an essential parts of gathering knowledge MCV, WBC and TIBC. A fuzzy logic system is presented to
information. These patterns are complicated to be analyzed this paper to extract the pattern among nine different
or even discovered by only human. Data mining techniques parameters for anemia prediction.
tries to perform the best model which to be nearest to the Depending on the variety levels of anemia development,
actual patterns that implement the data under examination. a decision system was suggested to classify its activity [15].
Detection of RBC shape/count disorders is important when In order to extract the patterns between different RBC related
data or even images are found. Some changes in RBC cell parameters for undernourishment conditions like iron
shapes according to different reasons able to help physician deficiency which cause anemia in pregnant women,
to detect anemic patient if related images are supplied for researchers presented an algorithm as seen in reference [16].
this reason as in [1]. A programmable architecture was Dataset for 539 are collected in Iraq for different parameters
proposed [2] to perform a simple anemia predictor according to map the cause of iron deficiency in such a region. Such a
to the color of RBCs. In addition, an automatic design was medical field should be carefully examined for serious points
proposed in [3] to detect anemia patients. In this article, two in order to detect such a disease at an early stage. With the
techniques have been used to recognize the differences help of this dataset, and especially for such unsupervised
between normal and abnormal forms of RBC. It has also regions, the pattern between the different parameters can be
been used a framework to compare it with the testing provided a clear vision of the main reason for these types of
samples using the Euclidian distance as a classification
disorders. This article expect to be helpful in guidance to resources for increasing ability and machine efficiency. A
health institutes and the related disease activities in the data mining technique based on the NB was used to evaluate
society in order to appropriately deal with this type of data in business improvement [23]. Main structure of the Naive
predicting the anemia types. Bayes representing each posterior and likelihood
probabilities is shown in Fig. 1.
In this article, four algorithms have been used to predict
anemic patients based on 10 attributes for 6 different classes.
In this section, a brief description of these algorithms, their
principles and the reasons for their selection are discussed.
The proposed structure used for this paper are shown in Fig.
1. As shown in the figure, the data set has been prepared to
be applied to four different mining techniques with and
without attribute evaluator. In addition, it compares the
techniques used after applying the feature selector to
examine the affected parameters on the total prediction Fig. 1. The main structure of Naive Bayes algorithm applied to the anemia
system. This paper also demonstrate the limitation of these data
techniques for the dataset with different attribute values.
B. Bayes Network
A. Naive Bayes The Bayesian Network is a conditional graphical method
that uses a Bayesian conclusion for probability assessment.
Based on the Bayes theorem, the Naive Bayes classifier is a
In this method, conditional dependence between feature
family of algorithms in which the characteristics of the edges will be a factor in this joint probability. Linked-based
relevant data share the same principles. These features are dataset classification was analyzed and predicted using
considered to be either independent of each other or equal multi-relational BN as seen in [24]. A crude Fourier
contribution to the classes. The main algorithm can be approach for image classification was applied using different
derived as data mining techniques [25]. A suitable application of the
P(C|F) = {P(F|C) * P(C)}/P(F) (1) BN for human activity recognition based on a large amount
of data sets was also carried out as described in [26]. A new
where C is the different classes for anemia, F is the recognition system based on the BN for abnormal human
dataset features used to be classified as C, P(C|F) means the activity was achieved in a monitoring video [27].
probability of class C which gives the true features F and
P(F|C) means the probability of the features F that gives the C. Function logistic
right classes. Logistic regression (LR) is also a machine learning
According to the assumption which applying approach based on the basics of probability to classify
independency between dataset features and then classes will different types of data using a sigmoid function instead of the
be separated into parts as given linear one. The protein structure prediction system was
introduced using the LR [28,29]. By modeling phase
P(C|f1, f2, …,fn) = { P(f1|C) * P(f2|C)…*P(fn|C)*P(C)} .
distribution parameters, the LR method was presented to
/ {P(f1) * P(f2) … * P(fn)}. (2) predict the extracted theoretical function with those
distributions [30]. Various applications were also presented
The final classification model can be calculated by the using the LR such as placement prediction system, risk
selection of the output with maximum values as expressed by prediction of mobile user and costumer churn prediction as
given in [31], [32] and [33].
C= argmaxc P(C) * ∏ P(fi|C) (3)
where P(C) is the probability of the data class and P(fi|C) III. COLLECTED DATASET
is the probability of conditional independent features. Data was collected for 539 patients with 10 attributes and
a 1 column for 6 different class types. Fig. 2 shows some of
Use of the Naive Bayes for unbalanced data is possible,
the information collected about anemia in the laboratory as
as in many articles [17]. In this article, the NB is provided by
described in detail in [34].
ANN dataset to solve the problem of unbalanced data, when
dataset are a collection of different types and some classes
are significant in count than other classes. Various
applications of this model have been presented in the
literature [18], based on a text categorization.
Researchers [19] suggested a model using the KNN and
NB to detect text/image email spam letters. In another article
[20], prediction of disease is more likely to be associated
with asthma prediction. Business information is taken as part
of a research focusing on prediction of the earning of
business intelligence as seen in [21]. A new prediction
system was introduced [22] to detect the hypervisor attacker
which mainly occurred when hosts are connected to
Fig. 2. Some features of the anemia data (a) Hemoglobin (HB), (b) Mean
Corpuscular Hemoglobin (MCH) and (c) Hematocrit (HCT).
Exhausted Mean
Accuracy Mean
parameter Time Absolute
(%) Square
(Sec.) Error
BN 0.06 85.1 0.056 0.198
NB 0.01 83.6 0.064 0.209
LR 0.33 87.3 0.062 0.183
MLP 1.61 87.1 0.054 0.19
various assumptions. The LR provides not only a measure of this is the linearity assumption between dependent and
different types of classes of related properties, but also independent features for the LR. On the other hand, the MLP
guides between features and classes. Table 3 shows the has a trained limitation which means that the MLP stuck at
confusion matrix generated using the LR based on the local minima region without stopping or catching the global
collected anemia data. one. MLP also needs to be trained several times at various
starting points, which is a major reason for the high
execution time in MLP. Also, the MLP has underfitting and
BASED ON THE ANEMIA DATA overfitting issues with manually selection of hidden layer
numbers. Fig. 5 showed the system performances before and
Class No. 1 2 3 4 5 6 after applying attribute evaluators on the same prediction
1 198 8 0 4 0 1
2 5 68 0 10 0 0
3 1 0 3 5 0 0
4 9 5 0 200 1 2
5 0 3 0 5 2 0
6 0 5 1 3 0 0
Exhausted Mean
Accuracy Mean
parameter Time Absolute (b)
(%) Square
(Sec.) Error Fig. 5. Anemia prediction system performances (a) before and (b) after
applying attribute evaluators
BN 0.02 85.3 0.057 0.205
NB 0.01 84.6 0.062 0.199 V. CONCLUSION
LR 0.17 86.1 0.068 0.183
MLP 0.84 86.1 0.068 0.189 Reduction of Red Blood Cells (RBC) causes an
insufficiency of oxygen, which forces the human body to
collapse if untreated or even detected at an early stage. In
According to Table 5, the results showed that the NB and
this paper, four different methods (BN, NB, LR and MLP)
BN have a better accuracy after using the attribute reduction
have been applied to detect the anemia types under the
due to robust assumption in the data distribution. Moreover,
consideration of 10 attributes through 539 samples. The LR
Also, training using the NB or BN is faster because it does
and MPL showed better performances compared with other
not have to deal with all dataset at once, and does not even
proposed techniques with 87.3% and 87.1%, respectively.
have to store them in a memory. The NB and BN have fast
Then, four different attribute evaluations have been utilized
execution time after combining selectors with mining
to find the risk factors that affect the prediction system the
most. It has been concluded that the LR and MLP keep their
Performances of the LR and MLP methods are lower high performance difficult, but still have the best results
than other recommended techniques. The main reason for compared to other proposed algorithms. It has been found
that the linearity assumption between dependent and reasons for the Bayesian developmental performance. As a
independent features lead to low performance for the LR. On further study, an optimization technique can be applied to
the other hand, it has been discovered that the MLP needs this data, taking into account the relevant features to improve
several training processes in order not to remain in the local the accuracy of this system prediction.
minima region and to maintain its high activity. Robust
assumption and rapid education have been seen as the main
