Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3488932.3517393acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article
Open access

EVOLIoT: A Self-Supervised Contrastive Learning Framework for Detecting and Characterizing Evolving IoT Malware Variants

Published: 30 May 2022 Publication History

Abstract

Recent years have witnessed the emergence of new and more sophisticated malware targeting the Internet of Things. Moreover, the public release of the source code of popular malware families such as Mirai has spawned diverse variants, making it harder to disambiguate their ownership, lineage, and correct label. Such a rapidly evolving landscape makes it also harder to deploy and generalize effective learning models against retired, updated, and/or new threat campaigns. In this paper, we present EVOLIoT, a novel approach aiming at combating "concept drift" and the limitations of inter-family IoT malware classification by detecting drifting IoT malware families and understanding their diverse evolutionary trajectories. We introduce a robust and effective contrastive method that learns and compares semantically meaningful representations of IoT malware binaries and codes without the need for expensive target labels. We find that the evolution of IoT binaries can be used as an augmentation strategy to learn effective representations to contrast (dis)similar variant pairs. We discuss the impact and findings of our analysis and present several evaluation studies to highlight the tangled relationships of IoT malware, as well as the efficiency of our contrastively learned feature vectors in preserving semantics and reducing out-of-vocabulary size in cross-architecture IoT malware binaries.

Supplementary Material

MP4 File (ASIA-CCS22-322.mp4)
Presentation Video

References

[1]
Uri Alon, Meital Zilberstein, et al. 2019. code2vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1--29.
[2]
Manos Antonakakis, Tim April, et al. 2017. Understanding the Mirai botnet. In 26th USENIX Security Symposium. USENIX Association, Vancouver, BC, 1093--1110.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, et al. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations. ICLR, Banff, Canada, 1--15.
[4]
Ulrich Bayer, Paolo Milani Comparetti, et al. 2009. Scalable, Behavior-Based Malware Clustering. In Network and Distributed System Security Symposium, Vol. 9. NDSS, San Diego, CA, 8--11.
[5]
Albert Bifet and Ricard Gavalda. 2007. Learning from Time-Changing Data with Adaptive Windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, Minneapolis, Minnesota, 443--448.
[6]
Alejandro Calleja, Juan Tapiador, et al. 2018. The Malsource Dataset: Quantifying Complexity and Code Reuse in Malware Development. IEEE Transactions on Information Forensics and Security 14, 12 (2018), 3175--3190.
[7]
Fabrício Ceschin, Marcus Botacin, et al. 2019. Shallow Security: On the Creation of Adversarial Variants to Evade Machine Learning-based Malware Detectors. In Proceedings of the 3rd Reversing and Offensive-oriented Trends Symposium. ACM, Vienna, Austria, 1--9.
[8]
Ting Chen, Simon Kornblith, et al. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, Virtual, 1597--1607.
[9]
Joana Costa, Catarina Silva, et al. 2014. Concept Drift Awareness in Twitter Streams. In 2014 13th International Conference on Machine Learning and Applications. IEEE, Detroit, MI, 294--299.
[10]
Andrei Costin and Jonas Zaddach. 2018. IoT Malware: Comprehensive Survey, Analysis Framework and Case Studies. BlackHat USA 1, 1 (2018), 1--9.
[11]
Emanuele Cozzi, Mariano Graziano, et al. 2018. Understanding Linux Malware. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, 161--175.
[12]
Emanuele Cozzi, Pierre-Antoine Vervier, et al. 2020. The Tangled Genealogy of IoT Malware. In Annual Computer Security Applications Conference (ACSAC). ACM, Austin, USA, 1--16.
[13]
Ahmad Darki, Michalis Faloutsos, et al. 2019. IDAPro for IoT Malware analysis?. In 12th USENIX Workshop on Cyber Security Experimentation and Test (CSET19). USENIX Association, Santa Clara, CA, 1--15.
[14]
Amit Deo, Santanu Kumar Dash, et al. 2016. Prescience: Probabilistic Guidance On the Retraining Conundrum for Malware Detection. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. ACM, Vienna, Austria, 71--82.
[15]
Jacob Devlin, Ming-Wei Chang, et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). ACL, Minneapolis, MN, USA, 4171--4186.
[16]
Mirabelle Dib, Sadegh Torabi, et al. 2021. A Multi-Dimensional Deep Learning Framework for IoT Malware Classification and Family Attribution. IEEE Transactions on Network and Service Management 18, 2 (2021), 1165--1177. https://doi.org/10.1109/TNSM.2021.3075315
[17]
Steven H. H. Ding, Benjamin C. M. Fung, et al. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, 472--489. https://doi.org/10.1109/SP.2019.00003
[18]
Denis Moreira dos Reis, Peter Flach, et al. 2016. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, San Francisco, California, 1545--1554.
[19]
Yue Duan, Xuezixiang Li, et al. 2020. Deepbindiff: Learning Program-Wide Code Representations for Binary Diffing. In Network and Distributed System Security Symposium. NDSS, San Diego, California, 1--16.
[20]
Sam Edwards and Ioannis Profetis. 2016. Hajime: Analysis of A Decentralized Internet Worm for IoT Devices. Rapidity Networks 16 (2016), 1--18.
[21]
Sebastian Eschweiler, Khaled Yakdan, et al. 2016. discovRE: Efficient Cross- Architecture Identification of Bugs in Binary Code. In Network and Distributed System Security, Vol. 52. NDSS, San Diego, California, 58--79.
[22]
Qian Feng, Rundong Zhou, et al. 2016. Scalable Graph-Based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, Vienna, Austria, 480--491.
[23]
João Gama, Indrė "liobaitė, et al. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 46, 4 (2014), 1--37.
[24]
Usha Devi Gandhi, Priyan Malarvizhi Kumar, et al. 2018. HIoTPOT: Surveillance on IoT Devices Against Recent Threats. Wireless personal communications 103, 2 (2018), 1179--1194.
[25]
Tianyu Gao, Xingcheng Yao, et al. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Empirical Methods in Natural Language Processing. EMNLP, Punta Cana, Dominican Republic, 1--17.
[26]
Leslie Ann Goldberg, PaulWGoldberg, et al. 1998. Constructing Computer Virus Phylogenies. Journal of Algorithms 26, 1 (1998), 188--208.
[27]
Harm Griffioen and Christian Doerr. 2020. Examining Mirai's Battle over the Internet of Things. In Proceedings of the 2020 ACMSIGSAC Conference on Computer and Communications Security. ACM, Virtual, 743--756.
[28]
Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. ACM Comput. Surv. 54, 3, Article 51 (2021), 38 pages.
[29]
Kaiming He, Haoqi Fan, et al. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, 9729--9738.
[30]
Dan Hendrycks and Kevin Gimpel. 2017. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In 5th International Conference on Learning Representations. ICLR, Toulon, France, 1--12.
[31]
Hex-Rays. 2005. IDA Pro Disassembler. https://www.hexrays.com/products/ida/.
[32]
He Huang, Amr M. Youssef, et al. 2017. BinSequence: Fast, Accurate and Scalable Binary Code Reuse Detection. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, Abu Dhabi, United Arab Emirates, 155--166.
[33]
Félix Iglesias, Tanja Zseby, et al. 2019. Absolute Cluster Validity. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2019), 2096--2112.
[34]
Vijayan J. 2018. Satori Botnet Malware Now Can Infect Even More IoT Devices.
[35]
Jiyong Jang, Maverick Woo, et al. 2013. Towards Automatic Software Lineage Inference. In 22nd USENIX Security Symposium. USENIX Association, Washington, DC, 81--96.
[36]
Jeff Johnson, Matthijs Douze, et al. 2021. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7, 3 (2021), 535--547.
[37]
Roberto Jordaney, Kumar Sharad, et al. 2017. Transcend: Detecting Concept Drift in Malware Classification Models. In 26th USENIX Security Symposium. USENIX Association, Vancouver, BC, Canada, 625--642.
[38]
Constantinos Kolias, Georgios Kambourakis, et al. 2017. DDoS in the IoT: Mirai and Other Botnets. Computer 50, 7 (2017), 80--84.
[39]
Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In International Conference on Machine Learning (ICML'14). PMLR, Beijing, China, II-1188-II-1196.
[40]
Zhouhan Lin, Minwei Feng, et al. 2017. A Structured Self-Attentive Sentence Embedding. In 5th International Conference on Learning Representations. ICLR, Toulon, France, 1--15.
[41]
Martina Lindorfer, Alessandro Di Federico, et al. 2012. Lines of Malicious Code: Insights into the Malicious Software Industry. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC '12). ACM, Orlando, Florida, 349--358.
[42]
Amy X Lu, Haoran Zhang, et al. 2020. Self-Supervised Contrastive Learning of Protein Representations by Mutual Information Maximization. BioRxiv 1, 1 (2020), 1--17.
[43]
Tongbo Luo, Zhaoyan Xu, et al. 2017. Iotcandyjar: Towards an Intelligent- Interaction Honeypot for IoT Devices. Black Hat 1 (2017), 1--11.
[44]
Luca Massarelli, Giuseppe Antonio Di Luna, et al. 2019. Safe: Self-Attentive Function Embeddings for Binary Similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Gothenburg, Sweden, 309--329.
[45]
L. McInnes, J. Healy, et al. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints 1, 1 (2018), 1--63. arXiv:1802.03426
[46]
Yu Meng, Chenyan Xiong, et al. 2021. Coco-lm: Correcting and Contrasting Text Sequences for Language Model Pretraining. NeurIPS abs/2102.08473 (2021), 1--16.
[47]
Tomas Mikolov, Ilya Sutskever, et al. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13). Curran Associates Inc., Red Hook, NY, USA, 3111--3119.
[48]
Jiang Ming, Dongpeng Xu, et al. 2015. Memoized Semantics-based Binary Diffing with Application to Malware Lineage Inference. In IFIP International Information Security and Privacy Conference, Vol. AICT-455. Springer, Hamburg, Germany, 416--430.
[49]
Aziz Mohaisen and Omar Alrawi. 2014. Av-meter: An Evaluation of Antivirus Scans and Labels. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Egham, UK, 112--131.
[50]
Andreas Moser, Christopher Kruegel, et al. 2007. Limits of Static Analysis for Malware Detection. In Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007). IEEE, Miami Beach, FL, USA, 421--430.
[51]
Quoc-Dung Ngo, Huy-Trung Nguyen, et al. 2020. A Survey of IoT Malware and Detection Methods Based on Static Features. ICT Express 6, 4 (2020), 280--286.
[52]
Aaron van den Oord, Yazhe Li, et al. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018), arXiv:1807.03748.
[53]
Yin Minn Pa Pa, Shogo Suzuki, et al. 2016. IoTPOT: A Novel Honeypot for Revealing Current IoT Threats. Journal of Information Processing 24, 3 (2016), 522--533.
[54]
Feargus Pendlebury, Fabio Pierazzi, et al. 2019. TESSERACT: Eliminating Experimental Bias in Malware Classification Across Space and Time. In 28th USENIX Security Symposium. USENIX Association, Baltimore, MD, 729--746.
[55]
Morteza Safaei Pour, Antonio Mangino, et al. 2020. On Data-Driven Curation, Learning, and Analysis for Inferring Evolving Internet-of-Things (IoT) Botnets in the Wild. Computers & Security 91 (2020), 101707.
[56]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ACL, Hong Kong, China, 3982--3992.
[57]
Mirai Source Code Release. 2016. https://krebsonsecurity.com/2016/10/sourcecode- for-iot-botnet-mirai-released/.
[58]
Konrad Rieck, Thorsten Holz, et al. 2008. Learning and Classification of Malware Behavior. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, Berlin, Heidelberg, 108--125.
[59]
Marcos Sebastián, Richard Rivera, et al. 2016. Avclass: A Tool for Massive Malware Labeling. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, Cham, 230--253.
[60]
Jiawei Su, Danilo Vargas Vasconcellos, et al. 2018. Lightweight Classification of IoT Malware Based on Image Recognition. In 2018 IEEE 42Nd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2. IEEE, Tokyo, Japan, 664--669.
[61]
Kai Sheng Tai, Richard Socher, et al. 2015. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. ACL, Beijing, China, 1556--1566.
[62]
Sadegh Torabi, Elias Bou-Harb, et al. 2022. Inferring and Investigating IoTGenerated Scanning Campaigns Targeting a Large Network Telescope. IEEE Transactions on Dependable and Secure Computing 19, 1 (2022), 402--418. https: //doi.org/10.1109/TDSC.2020.2979183
[63]
Sadegh Torabi, Mirabelle Dib, et al. 2021. A Strings-Based Similarity Analysis Approach for Characterizing IoT Malware and Inferring Their Underlying Relationships. IEEE Networking Letters 3, 3 (2021), 161--165. https: //doi.org/10.1109/LNET.2021.3076600
[64]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579--2605.
[65]
Ashish Vaswani, Noam Shazeer, et al. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., Long Beach, USA, 5998--6008.
[66]
Pierre-Antoine Vervier and Yun Shen. 2018. Before Toasters Rise Up: A view Into the Emerging IoT Threat Landscape. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, Heraklion, Crete, Greece, 556--576.
[67]
VirusShare. 2012. VirusShare. https://virusshare.com/
[68]
VirusTotal. 2004. VirusTotal. https://www.virustotal.com/
[69]
ZhuofengWu, SinongWang, et al. 2020. Clear: Contrastive Learning for Sentence Representation. CoRR abs/2012.15466 (2020), 1--10.
[70]
Xiaojun Xu, Chang Liu, et al. 2017. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, Dallas, Texas, 363--376.
[71]
Kevin K Yang, ZacharyWu, et al. 2018. Learned Protein Embeddings for Machine Learning. Bioinformatics 34, 15 (2018), 2642--2648.
[72]
Limin Yang, Wenbo Guo, et al. 2021. CADE: Detecting and Explaining Concept Drift Samples for Security Applications. In 30th USENIX Security Symposium. USENIX Association, Virtual, 2327--2344.
[73]
Fei Zuo, Xiaopeng Li, et al. 2018. Neural Machine Translation Inspired Binary Code Similarity Comparison Beyond Function P[airs. In Proceedings 2019 Network and Distributed System Security Symposium. NDSS, San Diego, California, 1--15.

Cited By

View all
  • (2024)Mateen: Adaptive Ensemble Learning for Network Anomaly DetectionProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678901(215-234)Online publication date: 30-Sep-2024
  • (2024)ReCDA: Concept Drift Adaptation with Representation Enhancement for Network Intrusion DetectionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672007(3818-3828)Online publication date: 25-Aug-2024
  • (2024)Efficient Malware Analysis Using Metric EmbeddingsDigital Threats: Research and Practice10.1145/36156695:1(1-20)Online publication date: 21-Mar-2024
  • Show More Cited By

Index Terms

  1. EVOLIoT: A Self-Supervised Contrastive Learning Framework for Detecting and Characterizing Evolving IoT Malware Variants

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASIA CCS '22: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security
    May 2022
    1291 pages
    ISBN:9781450391405
    DOI:10.1145/3488932
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concept drift
    2. contrastive learning
    3. iot malware classification

    Qualifiers

    • Research-article

    Conference

    ASIA CCS '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 418 of 2,322 submissions, 18%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)355
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Mateen: Adaptive Ensemble Learning for Network Anomaly DetectionProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678901(215-234)Online publication date: 30-Sep-2024
    • (2024)ReCDA: Concept Drift Adaptation with Representation Enhancement for Network Intrusion DetectionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672007(3818-3828)Online publication date: 25-Aug-2024
    • (2024)Efficient Malware Analysis Using Metric EmbeddingsDigital Threats: Research and Practice10.1145/36156695:1(1-20)Online publication date: 21-Mar-2024
    • (2024)Tracing the Ransomware Bloodline: Investigation and Detection of Drifting Virlock Variants2024 6th International Conference on Computer Communication and the Internet (ICCCI)10.1109/ICCCI62159.2024.10674108(1-6)Online publication date: 14-Jun-2024
    • (2024)Cimalir: Cross-Platform IoT Malware Clustering using Intermediate Representation2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC60891.2024.10427663(0460-0466)Online publication date: 8-Jan-2024
    • (2024)MalSSL—Self-Supervised Learning for Accurate and Label-Efficient Malware ClassificationIEEE Access10.1109/ACCESS.2024.339225112(58823-58835)Online publication date: 2024
    • (2024)Strengthening LLM ecosystem securityInformation Sciences: an International Journal10.1016/j.ins.2024.120923681:COnline publication date: 1-Oct-2024
    • (2023)BSGAT: A Graph Attention Network for Binary Code Similarity Detection2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC59308.2023.00028(161-167)Online publication date: 24-Oct-2023
    • (2023)Contrastive Learning Over Random Fourier Features for IoT Network Intrusion DetectionIEEE Internet of Things Journal10.1109/JIOT.2022.321475810:10(8505-8513)Online publication date: 15-May-2023
    • (2023)Self-Supervised Latent Representations of Network Flows and Application to Darknet Traffic ClassificationIEEE Access10.1109/ACCESS.2023.326320611(90749-90765)Online publication date: 2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media