ETCW15
ETCW15
ETCW15
Abstract The most dangerous online threat now-a-days is the threat in the mailing system. The spam are the threat in
the mailing system which are any unwanted and harmful mail for the security purpose. The spam mails should separate
from the rest of mails which are useful to the users. This paper surveys different spam filtering techniques. Support
Vector Machine (SVM) training problems and need to introduce Map Reduce Hadoop to train SVM. Techniques to
separate spam mails are word based. Content based, machine learning based and hybrid. The machine learning
techniques are most popular because of high accuracy and mathematical support. SVM is used mostly for machine
learning based technique because of its ability to handle data with large attributes, there are also some hurdles in the
training process of SVM, that cant be given as input, both of these problem should be solved by implementing the
training algorithm on map reduce (Hadoop) framework which gives up to 6 times speedup than sequential algorithm.
Keywords- Spam ltering techniques; Word based; Content based; Machine learning based; Support Vector Machine
(SVM); Map Reduce; Web Services
I. INTRODUCTION
In modern day communication tools, e-mail system is mostly used. E-mail became boon for business because of
its wide availability-mail is fastest way of communication as there is no need to wait for response. The main danger for
the e-mail is spam mail. Unwanted mail is also known as spam mail. Mails which are sent in bulk are spam mails.
Phishing websites, malicious attachments are sent by spam e-mails. Spam e-mails also include malicious scripts and
executable attachments [7]. The threat and the major problem like this of getting unwanted and malicious programs
motivated us to build the system which separates spam but also blocks the accounts sending it.
With increasing security measures in network services, remote exploitation is getting harder.
Therefore efcient ltering methods for spam messages are needed.
In this project, we introduce a more proactive approach that allows us to detect valid and spam mail.
The main aim is to design and develop a spam detecting system for emails using classification algorithm i.e.
SVM.
3.1. Whitelist/Blacklist.
In this approach the list is used. In the whitelist includes the email address or entire domains. Blacklist is exactly opposite
to the whitelist. It contains addresses which are harmful to users. Automatic list management tool is use in this method
[5].
3.3. Signatures.
In this approach signature is generated in which each spam message have unique hash value signature. The filters then
compare the values with the previous spam mails values stored. It is impossible to have same values to the valid e-mails.
In the whitelist and blacklist spam filtering approach the e-mail can be easily penetrated by spammer.
In the Signature based approach the spam emails are unable to identified as first it has to report as spam and
hash distributed.
In mail header checking approach there is high positive false rate and rejecting connection requires additional
information
In Bayesian classifier approach it is rely on Nave Bayes filtering which assumes event occurs independent of
each other.
V. SPAM FILTERING TECHNIQUES USED IN THE STANDARD SYSTEMS AS YAHOO AND GMAIL
Gmail supports many authentication systems such as SPF (Sender Policy Framework), Domain Keys, and DKIM
(Domain Keys Identified Mail).
In Figure. 1 Plane H1 is a good classifier and Plane H2 is not so good or even doesnt classify. The distance between the
two planes is called the interval or the margin. In case we cant find the hyper plane, when the point are not linearly
separable, the hyperspace could be extended to required distance according to the need. If all the samples in training
corpus can be correctly plotted out by a hyper plane and the distance of proximate vectors away from plane H, we call it
different-vectors, achieves the largest value, this plan is deemed to the most optimal categorization hyper plane. Its
@IJAERD-2017, All rights Reserved 59
International Journal of Advance Engineering and Research Development (IJAERD)
E.T.C.W, January -2017, e-ISSN: 2348 - 4470, print-ISSN: 2348-6406.
equation is WtX + b = 0, where w is the normal of categorization plan in that vector w is. We call this different-vectors
support vector, as the point including double circle. One of the most interesting features of SVM technique is that to find
the appropriate plane, SVM method just explores the nearest of all the points.
Each task of the Map Reduce paradigm can perform the serial SMO independently on their respective training sets. Map
Reduce gives output in form of {key, value} pair. Reduce task has {key, value} pairs generated by each Map task as
input and combines result of all Map tasks to get final output. All map and reduce tasks run independently.
By introducing the kernel, SVMs gain flexibility in the choice of the form of the threshold separating different
companies.
Since the kernel introduced by the SVM, a non-linear transformation assumes nothing about the form of the data
transformation, which makes data necessarily separable.
SVM provide a good out-of-sample generalization.
SVM deliver a unique solution, since the optimality problem is convex. This is an advantage compared to
Neural Networks, which have a number of solutions associated with local minima.
A common disadvantage of techniques such as SVM is that is doesnt has the transparency to show the results.
Being the dimensions too high, the score of different or all companies cant be represented by simple parametric
functions of the financial ratios. The financial ratios dont have the constant weights. Thus the variable score is generated
from the marginal contribution of each financial ratio.
IX. CONCLUSION
To develop a mailing system capable of isolating the spam mails and containing them separately in spam folder.
Also it includes the blocking of particular account for sending spams emails in bulk.
The disadvantages of the existing email system spam ltering systems would be overcome in this project.
The email system which will take input as the email received by its users , the received email is then validated
as spam or ham by comparing it to the keywords used as a spam mail using the algorithms.
As the spam emails can be send or are sent in a bulk in normal email system, in our system the account sending
spam mail in bulk will be blocked.
X. FUTURE SCOPE
The email system which will take input as the email received by its users , the received email is then validated
as spam or ham by comparing it to the keywords used as a spam mail using the algorithms. As the spam emails can be
send or are sent in a bulk in normal email system, in our system the account sending spam mail in bulk will be blocked
REFERENCES
[1] Rekha, Sandeep Negi , A Review on Different Spam Detection Approaches, International Journal of Engineering
Trends and Technology (IJETT) Volume 11 Number 6 - May 2014.
[2] Saadat Nazirova, Survey on Spam Filtering Techniques, Communications and Network, 2011, 3, 153-160
doi:10.4236/cn.2011.33019 Published Online August 2011 (http://www.SciRP.org/journal/cn)
[3] Amol G. Kakade, Prashant K. Kharat, Spam filtering techniques and Map Reduce with SVM: A study, 2014 Asia-
Pacific Conference on Computer Aided System Engineering (APCASE).
[4] J. Vijaya Chandra, Dr. Narasimham Challa,, Dr. Sai Kiran Pasupuleti, A Practical Approach to E-mail Spam Filters
to Protect Data from Advanced Persistent Threat, 2016 International Conference on Circuit, Power and Computing
Technologies [ICCPCT]
[5] Tarjani Vyas, 2Payal Prajapati, & 3Somil Gadhwal, A Survey and Evaluation of Supervised Machine Learning
Techniques for Spam E-Mail Filtering,
[6] Wanqing You, Kai Qian, Dan Lo, Prabir Bhattacharya, Minzhe Guo, Ying Qian, Web Service-enabled Spam
Filtering with Nave Bayes Classification, 2015 IEEE First International Conference on Big Data Computing
Service and Applications.
[7] Anirudh Harisinghaney, Arnan Dixit, Saurabh Gupta, Anuja Arora, Text and Image Based Spam Email
Classification using KNN, NaIve Bayes and Reverse DBSCAN Algorithm, 2014 International Conference on
Reliability, Optimization and Information Technology ICROIT 2014, India, Feb 6-8 2014.
[8] Godwin Caruana1, Maozhen Li1,3 and Man Qi2, A MapReduce based Parallel SVM for Large Scale Spam
Filtering, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)
[9] Amol G. Kakade1, Prashant K. Kharat2, Anil Kumar Gupta, Survey of Spam Filtering Techniques and Tools, and
MapReduce with SVM ,IJCSMC, Vol. 2, Issue. 11, November 2013, pg.91 98 .
[10] Salwa Adriana Saab, Nicholas Mitri, Mariette Awad, Ham or Spam? A comparative study for some Content-based
Classification Algorithms for Email Filtering, 17th IEEE Mediterranean Electrotechnical Conference, Beirut,
Lebanon, 13-16 April 2014