E-Mail Classification Using Genetic Algorithm With Heuristic Fitness Function
E-Mail Classification Using Genetic Algorithm With Heuristic Fitness Function
The basic idea is to find SPAM and HAM mails formthe
mails arriving in the mail box. As the fitness function is itself
problemdependent and cannot be fixed initially in SPAM
email filtering. For the evolution of the fitness function we
carried out experiments on 500 mails which consist of pool of
300 SPAM and 200 HAM mails, and we found that the
minimumscore point was 3 for the correct identification of
emails. Hence, we defined our fitness function as
1 Score point 3
0 Score point 3
F
>
=
<
5.3.1 Procedure:
An email consists of header and message or body. In the
header part Form, To, CC (carbon copy), BCC (black carbon
copy) and Subjects are the fields. In genetic algorithm, header
is irrelevant and only body part is taken into consideration.
From the body of the mail, words are extracted. In the
extraction of the word article like a, an, the, for and
numerical numbers are discarded.
In genetic algorithm, first database is created which will
classify spam and hamemails, and as per our choice database
can be divided into several categories. It must be remember
that as the size of the database increases, the number of word
in the data dictionary also increases. The selection of
categories depends on the classifications of the emails.
However, if lesser number of categories is defined, still email
can be identified as spam mail. However, the chances of false
positive/negative increases. In our experiment we considered
database of 2448 emails, out of which 1346 are SPAM mails
and rest 1102 mails are HAM mails. In the data-dictionary
421 are considered which are divided into seven categories.
The data dictionary is presented in appendix A. The procedure
of calculating weights for a word of a particular group is
detailed below:
Table 1 : Calculation of weights
Group Word Frequency Normalized
frequency of
getting a word
Weight of
word
Weight of
group
1
C
Sex 113 0.268 0.102 0.062
1
C
Nude 23 0.055 0.021
3
C
Free 694 1.648 0.63 0.391
3
C
Game 167 0.397 0.151
Lets for an example an email consists of four words namely
sex, nude, free and game. Out of these four words sex
and nude belongs to categories
1
C and Free and Game
belongs to categories
3
C (see Appendix -A). Let us consider
an email with 1103 words, out of which 997 words are sex,
nude, free and game. These words are taken so large in
number to make sure that the considered mail is a spammail
as the spam database is very small as it contains only 421
words. The extracted words formthe emails are first classify
as whether they belongs to any spamdatabase category. Once
if words in email match word in spamdata dictionary then the
probability of getting a word from the spam database is
obtained by dividing the frequency of a spamword by total
number of words in data dictionary. In our case nude occurs
23 times, hence probability of getting nude word is
23/421=0.268. The weight of the word (
w
W ) is calculated by
/
w WD WM
w
W WM
F T S
W
p T
=
, where
w
F : Frequency of spamword
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2959
WD
T : Total word in data dictionary
WM
S : Total spam word in e-mail
WM
T : Total word in e-mail
W
p
113/ 421 997
0.268 0.055 1.648 0.397 1103
w
W =
+ + +
0.102
w
W =
The weight of the category is calculated by taking the average
of the category for example the weight of category
1
C is
(0.102+0.021)/2=0.062. Then after normalization the weights
are converted in the range of 0.000 to 1.000. And using the
hex representation we have
The weight of the gene can be encoded as
Binary 0000000000 represents weight 0.000
Binary 0000000001 represents weight 0.001
Binary 0000000010 represents weight 0.002
.
Binary 1111100111 represents weight 0.999
Binary 1111111000 represents weight 1.000
Fig. 6 SPAM chromosomes prototype
Once, chromosomes are constructed for the incoming mails.
The process of genetic algorithmstarts and crossover takes
place. As discussed above there are various ways by which
cross-over can be performed. In crossover is only allowed for
bit of gene in particular category only. In our algorithm, both
multi-point and single point is done and positions of bits are
selected randomly. In each generation of chromosomes only
12% are crossed. The next process is mutation, here to recover
some of the lost genes or in our case it is done to recover some
of the lost data, here only 3 % of genes are mutated.
The weight of the words of gene in testing mail and the weight
of words of gene in spam mail prototype are compared to find
the matched gene. If number of matched gene, is greater than
or equal to three, than spammail prototype will receive one
score point. If the score point are greater than some threshold
score points than the mail is considered as spam mail.
However, the threshold point can be manually adjusted to get
the appropriate results as we fixed it by doing experiments on
500 emails.
4. RESULTS
In this paper introductory results are produced by considering
three mail prototypes. As in this method the body text is very
important in the classifications of mail. We selected three
different classes of e-mails.
Mail Prototype 1:The below mail is an example of SPAM
mail.
Dear,Sir/Madam
It?s with every sense of humility, sincerity and fairness do I
implore this mediumto reach you at this time.
In the first place, my names are Vijay Patel. 28 Years of
age. My ground parents migrated from India to the UK in
1932 and my parents and his siblings were all born here in the
UK.
My father Mr. Dinesh Patel Died as a result of heart attack he
heard after losing his Gold shop here in
Birghmirgham during the UK Riot by the Angry street guys
and claimed a lot of our belonging including lives and
property.
This occurrence led me to talk with my father?s Lawyer over
the Will of my father and he gave me a blue print which
stated that I amthe apparent heir of his Account with the
HSBC BANK UK and at present, I don?t feel safe or secure
anymore here in the UK. I deemit necessary for me to come
to India which is the Country of my Fathers and
settle down and also get married and settle instead of staying
in the UK and peradventure lose the remaining inheritance
willed to me by my late father.
I need an honest and truthful citizen of India who shall help
me in area of Investment of my fortune which is the sumof
Three Million, Seven hundred and Ninety Pounds My
proposal is a profit oriented venture. Therefore, I do need
your corporationand do update me with the norms that has to
do with an investment like in Real Estate or founding and
Academic Institution or any other venture that will
be profit incline. Our sharing formula is negotiable though I
have drafted it to be 70/30% in the profit sharing!
Conversely, your utmost corporation is required since I am
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2960
ready to dispatch this fund as peace and order has been
restored here in the UK. It?s necessary for us to work as a
formidable teamand build a business that will yield and
better our future. Kindly reply me and do not fail to write your
bio data shortly. Here is my contact number for easy and
fast communication .I can also speak in hindi
Faithfully
Vijay Patel
EMAIL:vijaypatel945@yahoo.co.uk
MOBILE:+447014239568
As all of us know that such a mail are SPAM that are easily
available in anybody mail box. The above email is tested with
our generated systemand the score point was 114.
Mail Prototype 2:The below mail is an example of HAM
mail.
Dear Dr. Srivastava,
Associate Editor Gilberto Brambilla invites you to review this
new submission to IEEE Photonics Technology Letters. If
you are unable to review this manuscript, it would be greatly
appreciated if you could please suggest alternative reviewers.
This is the abstract of the manuscript we would like you to
review:
Agreed: http://mc.manuscriptcentral.com/ptl-
ieee?URL_MASK=MRXPcGTstT9c5jDmYrmG
Declined: http://mc.manuscriptcentral.com/ptl-
ieee?URL_MASK=bTGYddXDn8kfnb93J7Ph
Unavailable: http://mc.manuscriptcentral.com/ptl-
ieee?URL_MASK=F9sqcXDdBD68Q9Tm2fT3
The site is located at:
http://mc.manuscriptcentral.com/ptl-ieee
Please reply to Sylvia Flores at s.j.flores@ieee.org with your
answer as to whether or not you agree to review this paper
(please do reply; we would rather have a "no" response than
no response at all). If you agree to reviw it, you will receive
an e-mail notice within a day instructing you to access the
Manuscript Central website and download the paper.
Thank you very much for your valuable service to the
community.
Sincerely,
Gilberto Brambilla
Associate Editor
IEEE Photonics Technology Letters
The above email is tested with our generated systemand the
score point was zero. Our proposed algorithmtreats this mail
as a HAM mail. Indeed it is a HAM mail.
Mail Prototype 3: The below mail is an example of false
positive mail.
Congratulation!! dear winner, we are using this mediumto
officially notify you: open the attachment in your mail box fill
the form and send it back to
US.nokiaclaimdept2013@live.co.uk
Regards
Dr. Darwin Payton
Event Manager
TEL: (+44) 7017048564
The above email is tested with our generated systemand the
score point was zero. Our proposed algorithmtreats this mail
as a HAM e-mail. However, it is a SPAM mail. Hence, this is
an example of false positive.
The above email is tested with our generated systemand the
score point was zero. Our proposed algorithmtreats this mail
as a HAM e-mail. However, it is a SPAM mail. Hence, this is
an example of false negative. This is happening because in our
data-dictionary the work like congratulation, winner,
claim are not present.
As stated above, Genetic Algorithms do not work well when
the population size is small and the rate of change is too high.
As we have taken only 421 words dictionary, hence
population size is very small, and the rate of change will be
very high as e-mail types are countless.
We did this experiment again by adding these words
congratulation, winner, claim in data dictionary and we
found that our systemworks well now with score point 4, and
treated this mail as SPAM mail.
In our early results we found that, if number of words in the
mail is larger, then more correct classification is possible. We
have checked our algorithmon large corpus of 2248 mails out
of which 1346 were SPAM mails and rest of them were HAM
mails.. The results on such a large email corpus are taken into
account to see more accurate classifications of mail and
effectiveness of GA algorithm. However, we did this
experiments on the high end machine to get more clear and
accurate picture of the GA. In our experiments we found that
the nearly 82% mails are correctly classified by our method.
The score point varies from4 to 137; however, it can go
further beyond 137 depending on the number of words in the
e-mail. In the future work, the in-depth analysis of the GA
parameters and size of spamdatabase on SPAM filtering is
presented.
5. CONCLUSION
In this paper, genetic algorithmis presented in detailed and it
has been discussed how GA can be beneficial in SPAM email
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2961
classifications. In genetic algorithm, first database is created
which will classify spam and ham emails, and as per our
choice database can be divided into several categories. It must
be remember that as the size of the database increases, the
number of word in the data dictionary also increases. The
selection of categories depends on the classifications of the
emails. However, if lesser number of categories is defined,
still email can be identified as spam mail. However, the
chances of false positive /negative increases. Many
experiments have been performed to fix some of the important
parameters of GA. The fitness function is selected very
carefully using doing a set of experiments. The proposed idea
has been tested on 2248 mails and the overall efficiency is
nearly 82%.
APPENDIX-A
Database [5][6]
Group Content Example of keywords in
each group
C1 Adult
adult, aphrodisiac, big, cam, climax,
company, cum, desire, erotic,
fantasy, fuck, gay, girl, greate, guy,
hard, hardcore, heaven, hot, huge,
long, man, max, maxlength, nude,
etc.
C2 Financial
Account, accountant, alert, analyst,
attorney, bank, bankruptcy, benefit,
bill, billing, broker, budget,
building, cash, cheque, commission,
consolidate, court, credit, creditor,
currency, customer, deposit, etc.
C3 Commercial
college, commerce, computer, cost,
deliver, discount, especial,
expensive, express, fantastic, free,
furnishing, furniture, game, gif, gift,
great, guarantee, inexpensive, etc.
C4 Beauty and diet after, age, amaze, anti-aging,
appetite, beauty, become, before,
believe, blood, body, botanic, breast,
build, burn, Diet calorie, capsule,
card, cell, change, chemical,
cholesterol, confirm, course, diet,
difference, dose, drug, effect,
effective, eliminate, energy,
enhance, exercise, eye, face, fast,
etc.
C5 Traveling
book, deluxe, excite, guide, holiday,
honest, hotel, luxury, meal, package,
plan, problem, relax, relief, reserve,
resort, summer, temple, ticket, tour,
train, travel, traveler, trip, vacation,
C 6 Home-Based
address, astonishment, base,
broadcast, bulk, business, comfort,
connect, demo, domain, downline,
download, Business earn, email,
emailing, ethernet, facemail, fresh,
home, homebased, homeworker,
host, income, interest, international,
etc.
C7 Gambling
action, award, bet, bonus, casino,
challenge, extra, gambling, gold,
hunt, las, lucky, millionaire, player,
poker, prize, reward, rich, vegas,
win, lottery, etc.
REFERENCES
[1] http://www.kaspersky.com/about/news/spam/2013/Spam_in_201
Continued_Decline_Sees_Spam_Levels_Hit_5_year_Low
[2] Blanzieri E. and Bryl A. 2008. A Survey of Learning
-Based Techniques of Email SpamFiltering, Conference on Email
and Anti-Spam.
[3] Koprowski G. J . 2006. Spamaccounts for most e-mail traffic,
Tech News World. Available:
http://www.technewsworld.com/story/51055.html
[4] Tang K.S. et.al. 1996. Genetic Algorithmand Their Applications,
IEEE Signal Processing magazine, pp.22-37.
[5] Sanpakdee U. et.al. 2006. Adaptive SpamMail Filtering Using
Genetic Algorithm.
[6] SpamAssassin, http://spamassassin.org.
AUTHORS
Jitendra N. Shrivastava received his
Master of Technology (M.Tech) degree in
Information Technology fromIndian Institute
of Information Technology (IIITA),
Allahabad, India in 2007. Presently he is
doing his research work in Singhania
University in the area of spam prevention
techniques. His research interests are Data
Mining and Artificial Intelligence. He has
published two books and research papers. He is board of studies
member for various autonomous institutions and universities. He can
be contacted by email jitendranathshrivastava@yahoo.com
Maringanti Hima Bindu received
doctorate (Ph.D.) Artificial Intelligence from
Indian Institute of Information Technology,
Allahabad, India in 2009. She has worked
with BHABHA Atomic Research Institute,
ISM, Dhanbad, IIIT, Allahabad. Presently she
is working as a Professor in North Orissa
university, India. Her research areas of
interests are Artificial Intelligence, Image
Processing and Pattern Recognition, Natural Language Processing
and Cognitive Science. She has published many papers in national
and international conferences and journals. She is the review board
member of various reputed journals. She is board of studies member
for various autonomous institutions and universities. She can be
contacted by email mhimabindu@yahoo.com.