Twitter Spam Detection Based On Deep Learning: Tingmin Wu, Shigang Liu, Jun Zhang and Yang Xiang
Twitter Spam Detection Based On Deep Learning: Tingmin Wu, Shigang Liu, Jun Zhang and Yang Xiang
Twitter Spam Detection Based On Deep Learning: Tingmin Wu, Shigang Liu, Jun Zhang and Yang Xiang
&ĞĂƚƵƌĞdžƚƌĂĐƚŝŽŶ
;sĞĐƚŽƌƐͿ tŽƌĚsĞĐƚŽƌ
>ĞĂƌŶŝŶŐůŐŽƌŝƚŚŵ
>ĞĂƌŶŝŶŐůŐŽƌŝƚŚŵ ,ŝŐŚͲŝŵĞŶƐŝŽŶ
sĞĐƚŽƌ&ĞĂƚƵƌĞ
ůĂƐƐŝĨŝĐĂƚŝŽŶ
Figure 2: New Twitter classification workflow based Figure 3: The procedure of learning document vec-
on deep learning tor, where N represents the number of the words in
a document.
??
and connectivity between followees and followers. Yang, et
al. [33] constructed social graph according to the local clus-
tering coefficient, betweenness centrality and bidirectional representation for words are applied widely in systems of
links ratio. Based on the social graphs, spamming account linguistic analysis [8, 28].
will be detected by analysing graph mathematical features.
The features used in this method was proved to be more 3.2 Detection Framework
robust than existing algorithms. However, when consider- Different from the conventional detection, we complete
ing time cost on data collection, this method becomes too picking up attributes according to the content of tweets us-
complex to be used in the real world. ing Word2Vec instead of feature collection and generation.
First of all, we apply Word2Vec to map each word in
2.3 Blacklist Techniques the whole dataset into corresponding multidimensional vec-
Blacklist techniques are commonly deployed in web filter- tor. It employs a two-level neural network, where Huffman
ing services such as Twitter spam detection, with the func- technique is used as hierarchical softmax to allocate codes
tionality of blocking malicious websites according to their to frequent words [22]. It improves the efficiency of train-
information analysis like user feedback and website crawl- ing model, since high-frequency words can be processed fast
ing. Ma, et al. [21] presented a lightweight blacklisting [14]. Applying this technique, the word vector-based repre-
approach with lower cost than existing classifiers. Oliver, sentation is trained through stochastic gradient descent and
et al. [23] detected baleful URLs by using blacklisting tech- the gradient is achieved by backpropagation. What’s more,
nique which was integrated in a so-called Web Reputation optimal vectors are obtained for each word by CBOW or
Technology. However, this method has to rely on manual Skip-gram [22].
labelling which is too time-consuming. Furthermore, Doc2Vec training model is used to assign
In a nutshell, current spam detection methods on Twitter one vector representing every tweet using Paragraph Vector
are still not sufficient to detect spamming activities quickly modelling [14]. Based on Word2Vec, a tweet-length doc-
and accurately in terms of Recall, Precision and F1-measure. ument vector is trained by the combination of word vec-
To achieve less time consuming and better performance turn- tors and unique document vector per record. By repeating
s into the motivation of our work. the procedures, each optimal document-based vector can be
learned (as shown in Figure ??).
After document vectors with high-level dimension learned,
3. DEEP LEARNING BASED CLASSIFIER they are treated as the input features of several machine
This section describes a new Twitter spam detection tech- learning techniques, such as the Random Forest or Neural
nique including vector-based characteristics training by Word- Network, along with the label of spam/non-spam. The doc-
Vector techniques and binary classifier building using mul- ument representation d⃗ can be defined as
tiple machine learning algorithms. Figure 2 shows the work-
flow of distinguishing Twitter spam through our new method. ⃗ = {d1 , d2 , . . . , dM },
D
where M is the dimension amount of the document vector,
3.1 Deep Learning Primer d is the value for each level of it.
With the limitation of Natural Language Processing (NLP) By adding the variable binary label, the tweet can be in-
ability for conventional machine learning algorithm using dicated as
raw strings, deep learning was developed to be competent
to understand and analyse text using a deep neural network ⃗ label),
⃗t = (D,
with multiple layers [15]. Through the network, each output
where t represents the concatenate vector, and label is the
of the previous layer turns to be the input of the next level.
tweet flag of spam or non-spam.
In particular, deep leaning neural language techniques owe
Thus, the training dataset T is expressed as
strong ability on language analysis, with distributed vectors
trained under WordVector method [15]. Text-based vector T = (⃗t1 , ⃗t2 , . . . , ⃗tN ),
Table 1: The List of Methods for Comparison
Detection Method Description
Text-based using Random Forest: classification applying the Random Forest algorithm [5] to process the word rep-
Deep Learning resentation trained by WordVector Technique.
(Internal) Neural Network (MLP): detection method using the neural network MLP [10] with input extracted
by WordVector.
Decision Tree: employing a greedy splitting method to build a tree [9], along with WordVector
pre-processing.
Traditional Palladian: the text classifier working with n-grams which are a series of tokens for the length [30].
Text-based (Vertical Complementary Naive Bayes: Multinomial Naive Bayes model which can detect words distribution
Comparison) in documents [24].
Complementary Naive Bayes (Frequencies): Complementary one with term frequency. [30]
Feature-based Naive Bayes: a two-layer classification method, with one level representing the label of spam/non-
Supported by spam,and another including a set of features [2].
Machine Learning Random Forest: an anti-sensitive method, with an extra layer added [18].
(Horizontal Decision Tree (C4.5): a traditional machine learning technique with multiple retrieving and ordering
Comparisom) [19].
F-measure
Precision
Accuracy
0.6 0.6 0.6 0.6
Recall
0 0 0 0
D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4
(A) (B) (C) (D)
Figure 4: Performance Value of our detection method based on deep learning based on 4 sampled datasets.
(A) Recall; (B) Precision; (C) F-measure; (D) Accuracy
1 1 1 1
Accuracy
0.6 0.6 0.6 0.6
Recall
0 0 0 0
D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4
(A) (B) (C) (D)
Figure 5: Vertical Comparison of performance values between our technique and traditional text-based
detection approaches based on 4 sampled datasets. (A) Recall; (B) Precision; (C) F-measure; (D) Accuracy
Accuracy
0 0 0 0
D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4
(A) (B) (C) (D)
Figure 6: Horizontal Comparison of performance values between our technique and feature-based methods
based on 4 sampled datasets. (A) Recall; (B) Precision; (C) F-measure; (D) Accuracy
our proposed method using MLP performs is better than
Table 4: Impact of the Spam Ratio by Dataset 1 and all current work in terms of Recall, F-measure and Accura-
2 using MLP cy. For Precision, the performance of our method is about
Unit: % Recall Precision F-measure Accuracy 25% higher than the second place on Dataset 2 but 5% less
Dataset 1 93.48 95.04 94.25 94.30 than the best for other datasets. It even achieves double
Dataset 2 91.03 95.84 93.37 99.35 F-measure of Naive Bayes (Frequencies) on Dataset 2 and
4. Overall, it outperforms all the others.
Table 5: Impact on Sample Dataset Discretisation 4.4 Comparison (vs. Feature-based Methods)
of Dataset 1 and 3 using MLP We further compare our method to other feature-based
Unit: % Recall Precision F-measure Accuracy detection methods. The performances for all four metrics
Dataset 1 93.48 95.04 94.25 94.30 on for datasets are better than other all the time. As shown
Dataset 3 91.48 94.23 92.83 92.94 in Figure 6, the F-measure is much higher than others, with
averagely 30% higher than Random Forest and almost nine
times of Naive Bayes in Dataset 2 and 4. Even the Decision
rectly to all tweets. It is expressed as Tree method achieves almost the same as our method at
TP + TN Dataset 1, it only remains half when testing on Dataset 4.
Accuracy =
TP + FP + FN + TN
Recall (Sensitivity) is defined as the ratio of correctly clas- 5. DISCUSSION
sified spam in total actual spam, as According to our performance evaluation, there are two
TP factors that affects the classifier function in terms of dataset:
Recall = 1) the proportion of spam and non-spam and 2) the sample
TP + FN
dataset discretisation.
Precision is defined as true projected spam to classified
spam. It can be obtained by 5.1 Impact of Spam Ratio
TP We show the impact of the spam ratio in Table 4. It can be
P recision = found that with the change of spam ratio, the performance of
TP + FP
our proposed method remains stable. The best one achieves
F-measure is the harmonic mean of Precision and Recall, 2.45% on Racall. Therefore, it affects other text-based or
and it can be calculated as follow: non-text-based significantly. For example, in Figure 5, the
2 ∗ P recision ∗ Recall 2T P F-measure of Naive Bayes (Frequencies) is only half in the
F −measure = = ratio of 1:19 (spam:non-spam) dataset of it in 1:1 dataset.
P recision + Recall 2T P + F P + F N
In addition, the F-measure of Naive Bayes is averagely 60%
in 1:1 dataset, but it becomes one fifth in 1:19 dataset.
4.2 Comparison of Classifiers
In this subsection, we evaluate the performance of our 5.2 Impact of Sample Dataset Discretisation
work through three different classifying algorithms with vec- We further study the impact of sample dataset discretisa-
tors input trained by WordVector technique in a deep learn- tion. The results are shown in Table 5. It is found that with
ing style on four sampling datasets. The comparison results the change of the ratio of spam, the performance of our pro-
will suggest the optimal classifier that can be used in our posed method remains stable. The biggest difference is only
method (i.e. internal comparison). The list of classifiers is 2.% on Recall. Accordingly, the performance on continu-
shown in Table 1. ous dataset would be slightly better than randomly sampled
As is shown in Figure 4, all the three algorithms perform dataset for all detection from Figure 4, 5 and 6.
well. Almost all performance values are higher than 80%,
and most of them are more than 90%. As represented in
Figure 4, the technique of Random Forest outperforms the 6. CONCLUSIONS AND FUTURE WORK
other two methods all the time at the aspects of Precision In this paper, we explored the issues on the current Twit-
and Accuracy. Furthermore, MLP achieves the highest per- ter spam detection techniques, and proposed a new classifi-
formance in the metrics of Recall, Precision and Accuracy cation method based on deep learning algorithms to address
over all the four datasets. For the F-measure, MLP achieves them. For the purpose of judging its performance evalu-
highest performance in Dataset 2 and 4, and the second best ation, we firstly collected a part of labeled data (376,206
performance in Dataset 1 and 3. Since the ratio of spam on spam and 73,836 non-spam tweets) from a 10-day ground-
Dataset 2 and 4 is similar to the real world, it is reasonable truth dataset with more than 600 million real-world tweets.
to achieve the highest F-measure on them. Besides, there is Then we utilized WordVector technique for pre-processing
no significant difference among the four performance metrics them and converted them into high-dimension vectors.
(all ∼ 95% averagely). In the following, we select MLP as Future work may include several aspects: 1) The eval-
our classification method along with WordVector technique uation of this paper is mainly on empirical studies. We
to compare to other approaches as listed in Table 1. will carry out theoretical studies on the outperformance of
our methods in order to better understand the deep-learning
4.3 Comparison (vs. Syntax-based Methods) based spam detection framework. This will in addition help
In this section, we compare our method to 3 existing text- us improve the performance. 2) We will compare more clas-
based techniques vertically. Figure 5 describes the differ- sifier and other methods in the future in order to demon-
ences among different text-based methods. It indicates that strate the pros and cons of our proposed method. 3) We
will finally collect more real data from social media, partic- networks. Proceedings of the VLDB Endowment,
ularly the datasets from other social media such as Facebook 4(12):1458–1461, 2011.
and microblogs, and study the immigration of our spam de- [14] Q. V. Le and T. Mikolov. Distributed representations
tection framework. This part of work is very important to of sentences and documents. In ICML, volume 14,
both industries and academia because social spam is also pages 1188–1196, 2014.
very critical in other social media platforms. [15] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.
Nature, 521(7553):436–444, 2015.
7. REFERENCES [16] K. Lee, J. Caverlee, and S. Webb. Uncovering social
spammers: social honeypots+ machine learning. In
[1] R. Aires, A. Manfrin, S. M. Aluı́sio, and D. Santos. Proceedings of the 33rd international ACM SIGIR
Which Classification Algorithm Works Best with conference on Research and development in
Stylistic Features of Portuguese in Order to Classify information retrieval, pages 435–442. ACM, 2010.
Web Texts According to Users’ needs?. ICMC-USP, [17] S. Lee and J. Kim. Warningbird: Detecting suspicious
2004. urls in twitter stream. In NDSS, volume 12, pages
[2] N. B. Amor, S. Benferhat, and Z. Elouedi. Naive bayes 1–13, 2012.
vs decision trees in intrusion detection systems. In [18] A. Liaw and M. Wiener. Classification and regression
Proceedings of the 2004 ACM symposium on Applied by randomforest. R news, 2(3):18–22, 2002.
computing, pages 420–424. ACM, 2004. [19] S. Liu, J. Zhang, Y. Wang, and Y. Xiang. Fuzzy-based
[3] F. Benevenuto, G. Magno, T. Rodrigues, and feature and instance recovery. In Asian Conference on
V. Almeida. Detecting spammers on twitter. In Intelligent Information and Database Systems, pages
Collaboration, electronic messaging, anti-abuse and 605–615. Springer, 2016.
spam conference (CEAS), volume 6, page 12, 2010. [20] S. Liu, J. Zhang, and Y. Xiang. Statistical detection of
[4] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, online drifting twitter spam: Invited paper. In
T. Kötter, T. Meinl, P. Ohl, K. Thiel, and Proceedings of the 11th ACM on Asia Conference on
B. Wiswedel. Knime-the konstanz information miner: Computer and Communications Security, pages 1–10.
version 2.0 and beyond. AcM SIGKDD explorations ACM, 2016.
Newsletter, 11(1):26–31, 2009. [21] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker.
[5] L. Breiman. Random forests. Machine learning, Learning to detect malicious urls. ACM Transactions
45(1):5–32, 2001. on Intelligent Systems and Technology (TIST),
[6] C. Chen, J. Zhang, Y. Xiang, and W. Zhou. 2(3):30, 2011.
Asymmetric self-learning for tackling twitter spam [22] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
drift. In 2015 IEEE Conference on Computer Efficient estimation of word representations in vector
Communications Workshops (INFOCOM WKSHPS), space. arXiv preprint arXiv:1301.3781, 2013.
pages 208–213. IEEE, 2015. [23] J. Oliver, P. Pajares, C. Ke, C. Chen, and Y. Xiang.
[7] C. Chen, J. Zhang, Y. Xie, Y. Xiang, W. Zhou, M. M. An in-depth analysis of abuse on twitter. Trend Micro,
Hassan, A. AlElaiwi, and M. Alrubaian. A 225, 2014.
performance evaluation of machine learning-based [24] J. D. Rennie, L. Shih, J. Teevan, D. R. Karger, et al.
streaming spam tweets detection. IEEE Transactions Tackling the poor assumptions of naive bayes text
on Computational Social Systems, 2(3):65–76, 2015. classifiers. In ICML, volume 3, pages 616–623.
[8] R. Collobert, J. Weston, L. Bottou, M. Karlen, Washington DC), 2003.
K. Kavukcuoglu, and P. Kuksa. Natural language [25] K. Rybina. Sentiment analysis of contexts around
processing (almost) from scratch. Journal of Machine query terms in documents. PhD thesis, MasterâĂŹs
Learning Research, 12(Aug):2493–2537, 2011. thesis, 2012.
[9] T. G. Dietterich. Ensemble methods in machine [26] J. Song, S. Lee, and J. Kim. Spam filtering in twitter
learning. In International workshop on multiple using sender-receiver relationship. In International
classifier systems, pages 1–15. Springer, 2000. Workshop on Recent Advances in Intrusion Detection,
[10] V. N. Ghate and S. V. Dudul. Optimal mlp neural pages 301–317. Springer, 2011.
network classifier for fault detection of three phase [27] G. Stringhini, C. Kruegel, and G. Vigna. Detecting
induction motor. Expert Systems with Applications, spammers on social networks. In Proceedings of the
37(4):3468–3481, 2010. 26th Annual Computer Security Applications
[11] C. Grier, K. Thomas, V. Paxson, and M. Zhang. @ Conference, pages 1–9. ACM, 2010.
spam: the underground on 140 characters or less. In [28] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to
Proceedings of the 17th ACM conference on Computer sequence learning with neural networks. In Advances
and communications security, pages 27–37. ACM, in neural information processing systems, pages
2010. 3104–3112, 2014.
[12] A. Java, X. Song, T. Finin, and B. Tseng. Why we [29] D. Tang, F. Wei, B. Qin, T. Liu, and M. Zhou.
twitter: understanding microblogging usage and Coooolll: A deep learning system for twitter sentiment
communities. In Proceedings of the 9th WebKDD and classification. In Proceedings of the 8th International
1st SNA-KDD 2007 workshop on Web mining and Workshop on Semantic Evaluation (SemEval 2014),
social network analysis, pages 56–65. ACM, 2007. pages 208–212, 2014.
[13] X. Jin, C. Lin, J. Luo, and J. Han. A data [30] D. Urbansky, K. Muthmann, P. Katz, and S. Reichert.
mining-based spam detection system for social media
Tud palladian overview. TU Dresden, Department of
Systems Engineering, Chair Computer Networks, IIR
Group, 5, 2011.
[31] A. H. Wang. Don’t follow me: Spam detection in
twitter. In Security and Cryptography (SECRYPT),
Proceedings of the 2010 International Conference on,
pages 1–10. IEEE, 2010.
[32] D. Wang, S. B. Navathe, L. Liu, D. Irani,
A. Tamersoy, and C. Pu. Click traffic analysis of short
url spam on twitter. In Collaborative Computing:
Networking, Applications and Worksharing
(Collaboratecom), 2013 9th International Conference
Conference on, pages 250–259. IEEE, 2013.
[33] C. Yang, R. Harkreader, and G. Gu. Empirical
evaluation and new design for fighting evolving twitter
spammers. IEEE Transactions on Information
Forensics and Security, 8(8):1280–1293, 2013.