Detecting Malicious URLs Using a Deep Learning Approach Based on Stacked Denoising Autoencoder

Yan, Huaizhi; Zhang, Xin; Xie, Jiangwei; Hu, Changzhen

doi:10.1007/978-981-13-5913-2_23

Huaizhi Yan^12,13,
Xin Zhang^12,13,
Jiangwei Xie^12,13 &
…
Changzhen Hu^12,13

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 960))

Included in the following conference series:

Chinese Conference on Trusted Computing and Information Security

709 Accesses
3 Citations

Abstract

As the source of spamming, phishing, malware and many more such attacks, malicious URL is a chronic and complicated problem on the Internet. Machine learning approaches have taken effect and obtained high accuracy in detecting malicious URL. But the tedious process of extracting features from URL and the high dimension of feature vector makes the implementing time consuming. This paper presents a deep learning method using Stacked denoising autoencoders model to learn and detect intrinsic malicious features. We employ an SdA network to analyze URLs and extract features automatically. Then a logistic regression is implemented to detect malicious and benign URLs, which can generate detection models without a manually feature engineering. We have implemented our network model using Keras, a high-level neural networks API with a Tensor-flow backend, an open source deep learning library. 5 datasets were used and 4 other method were compared with our model. In the result, our architecture achieves an accuracy of 98.25% and a micro-averaged F1 score of 0.98, tested on a mixed dataset containing around 2 million samples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A novel scalable intrusion detection system based on deep learning

Article 15 June 2020

DeepAM: a heterogeneous deep learning framework for intelligent malware detection

Article 09 May 2017

Detection and Multi-Class Classification of Intrusion in Software Defined Networks Using Stacked Auto-Encoders and CICIDS2017 Dataset

Article 28 September 2021

References

Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254. ACM (2009)
Google Scholar
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206. ACM (2011)
Google Scholar
Wang, D., Navathe, S.B., Liu, L., Irani, D., Tamersoy, A., Pu, C.: Click traffic analysis of short URL spam on twitter. In: 2013 9th International Conference on Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), pp. 250–259. IEEE (2013)
Google Scholar
Eshete, B., Villafiorita, A., Weldemariam, K.: BINSPECT: holistic analysis and detection of malicious web pages. In: SecureComm, pp. 149–166 (2012)
Google Scholar
Berners-Lee, T., Masinter, L., McCahill, M.: Uniform resource locators (URL). Technical report (1994)
Google Scholar
Zhang, H.-L., Zou, W., Han, X.-H.: Drive-by-download mechanisms and defenses. J. Softw. 24(4), 843–858 (2013). (in Chinese)
Article Google Scholar
Sha, H.-Z., Zhou, Z., Liu, Q.-Y., Qin, P.: Light-weight self-learning for URL classification. J. Commun. 35(9), 32–39 (2014)
Google Scholar
Klien, F., Strohmaier, M.: Short links under attack: geographical analysis of spam in a URL shortener network. In: Proceedings of the 23rd ACM Conference on Hypertext and Social Media, pp. 83–88. ACM (2012)
Google Scholar
Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference, ATNAC 2008, pp. 91–96. IEEE (2008)
Google Scholar
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Mohammad, R.M., Thabtah, F., McCluskey, L.: Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25(2), 443–458 (2014)
Article Google Scholar
Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)
Wang, Y., Cai, W.D., Wei, P.C.: A deep learning approach for detecting malicious JavaScript code. Secur. Commun. Netw. 9(11), 1520–1534 (2016)
Article Google Scholar
Bahnsen, A.C., Bohorquez, E.C., Villegas, S., Vargas, J., González, F.A.: Classifying Phishing URLs Using Recurrent Neural Networks (2017)
Google Scholar
Sha, H.-Z., Liu, Q.-Y., Liu, T.-W.: Survey on malicious webpage detection research. Chin. J. Comput. 39(3), 529–542 (2016)
MathSciNet Google Scholar
Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 (2017)
Wang, D., Navathe, S.B., Liu, L., Irani, D., Tamersoy, A., Pu, C.: Click traffic analysis of short URL spam on Twitter. In: 2013 9th International Conference Conference on Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), pp. 250–259. IEEE (2013)
Google Scholar
Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of a real-time URL spam filtering service. In: 2011 IEEE Symposium on Security and Privacy (SP), pp. 447–462. IEEE (2011)
Google Scholar
Pao, H.K., Chou, Y.L., Lee, Y.J.: Malicious URL detection based on kolmogorov complexity estimation. In: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 380–387. IEEE Computer Society (2012)
Google Scholar
Yousefi-Azar, M., Varadharajan, V., Hamey, L., Tupakula, U.: Autoencoder-based feature learning for cyber security applications. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3854–3861. IEEE (2012)
Google Scholar
Does Alexa have a list of its top-ranked websites? https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites. Accessed 06 Apr 2016
Zhao, P., Hoi, S.C.H.: Cost-sensitive online active learning with application to malicious URL detection. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 919–927. ACM (2013)
Google Scholar
Le Roux, N., Bengio, Y.: Deep belief networks are compact universal approximators. Neural Comput. 22(8), 2192–2207 (2010)
Article MathSciNet Google Scholar
Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet Google Scholar
Hinton, G.: A practical guide to training restricted Boltzmann machines. Momentum 9(1), 926 (2010)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., et al.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Google Scholar
Menon, A.K.: Large-Scale Support Vector Machines: Algorithms and Theory. Research Exam (2009)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (Grant No. 2016YFB0800700 & No. 2016YFC1000300).

Author information

Authors and Affiliations

School of Computer, Beijing Institute of Technology, Beijing, 100081, China
Huaizhi Yan, Xin Zhang, Jiangwei Xie & Changzhen Hu
Beijing Key Laboratory of Software Security Engineering Technology (Beijing Institute of Technology), Beijing, 100081, China
Huaizhi Yan, Xin Zhang, Jiangwei Xie & Changzhen Hu

Authors

Huaizhi Yan
View author publications
You can also search for this author in PubMed Google Scholar
Xin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiangwei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Changzhen Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaizhi Yan .

Editor information

Editors and Affiliations

School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Huanguo Zhang
School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Bo Zhao
School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Fei Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, H., Zhang, X., Xie, J., Hu, C. (2019). Detecting Malicious URLs Using a Deep Learning Approach Based on Stacked Denoising Autoencoder. In: Zhang, H., Zhao, B., Yan, F. (eds) Trusted Computing and Information Security. CTCIS 2018. Communications in Computer and Information Science, vol 960. Springer, Singapore. https://doi.org/10.1007/978-981-13-5913-2_23

Download citation

DOI: https://doi.org/10.1007/978-981-13-5913-2_23
Published: 09 January 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5912-5
Online ISBN: 978-981-13-5913-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Detecting Malicious URLs Using a Deep Learning Approach Based on Stacked Denoising Autoencoder

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A novel scalable intrusion detection system based on deep learning

DeepAM: a heterogeneous deep learning framework for intelligent malware detection

Detection and Multi-Class Classification of Intrusion in Software Defined Networks Using Stacked Auto-Encoders and CICIDS2017 Dataset

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Detecting Malicious URLs Using a Deep Learning Approach Based on Stacked Denoising Autoencoder

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A novel scalable intrusion detection system based on deep learning

DeepAM: a heterogeneous deep learning framework for intelligent malware detection

Detection and Multi-Class Classification of Intrusion in Software Defined Networks Using Stacked Auto-Encoders and CICIDS2017 Dataset

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation