Empirical assessment of machine learning-based malware detectors for Android

Allix, Kevin; Bissyandé, Tegawendé F.; Jérome, Quentin; Klein, Jacques; State, Radu; Le Traon, Yves

doi:10.1007/s10664-014-9352-6

Empirical assessment of machine learning-based malware detectors for Android

Measuring the gap between in-the-lab and in-the-wild validation scenarios

Published: 24 December 2014

Volume 21, pages 183–211, (2016)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Kevin Allix¹,
Tegawendé F. Bissyandé¹,
Quentin Jérome¹,
Jacques Klein¹,
Radu State¹ &
…
Yves Le Traon¹

2263 Accesses
87 Citations
1 Altmetric
Explore all metrics

Abstract

To address the issue of malware detection through large sets of applications, researchers have recently started to investigate the capabilities of machine-learning techniques for proposing effective approaches. So far, several promising results were recorded in the literature, many approaches being assessed with what we call in the lab validation scenarios. This paper revisits the purpose of malware detection to discuss whether such in the lab validation scenarios provide reliable indications on the performance of malware detectors in real-world settings, aka in the wild. To this end, we have devised several Machine Learning classifiers that rely on a set of features built from applications’ CFGs. We use a sizeable dataset of over 50 000 Android applications collected from sources where state-of-the art approaches have selected their data. We show that, in the lab, our approach outperforms existing machine learning-based approaches. However, this high performance does not translate in high performance in the wild. The performance gap we observed—F-measures dropping from over 0.9 in the lab to below 0.1 in the wild—raises one important question: How do state-of-the-art approaches perform in the wild?

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities

Article Open access 02 August 2022

Assessing the opportunity of combining state-of-the-art Android malware detectors

Article Open access 24 December 2022

Android Malware Detection in Large Dataset: Smart Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Google Play was formerly known as Google Market
https://www.virustotal.com
Dalvik is a virtual machine that is included in the Android OS
https://github.com/malwaredetector/malware-detect
http://www.cs.waikato.ac.nz/ml/weka/
The value of k used by Sahs & Khan was not disclosed.
While 10-Fold is equivalent to testing 10 times on 10 % while being trained on 90 % of the dataset, 5-Fold is equivalent to testing 5 times on 20 % while being trained on 80 % of the dataset.

References

Allix K, Bissyandé TF, Jérome Q, Klein J, State R, Le Traon Y (2014a) Large-scale machine learning-based malware detection: confronting the “10-fold cross validation” scheme with reality. In: Proceedings of the 4th ACM conference on data and application security and privacy. ACM, New York, CODASPY ’14, pp 163–166. doi:10.1145/2557547.2557587
Allix K, Jérome Q, Bissyandé TF, Klein J, State R, Le Traon Y (2014b) A forensic analysis of android malware: how is malware written and how it could be detected? In: Computer software and applications conference (COMPSAC)
Amos B, Turner H, White J (2013) Applying machine learning classifiers to dynamic android malware detection at scale. In: 2013 9th international wireless communications and mobile computing conference (IWCMC), pp 1666–1671. doi:10.1109/IWCMC.2013.6583806
AndroGuard (2013) Apktool for reverse engineering android applications. https://code.google.com/p/androguard/. Accessed 09 Sep 2013
AppBrain (2013a) Comparison of free and paid android apps. http://www.appbrain.com/stats/free-and-paid-android-applications. Accessed 09 Sep 2013
AppBrain (2013b) Number of available android applications. http://www.appbrain.com/stats/number-of-android-apps. Accessed 09 Sep 2013
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Canfora G, Mercaldo F, Visaggio CA (2013) A classifier of malicious android applications. In: 2013 eight international conference on availability, reliability and security (ARES)
Cesare S, Xiang Y (2010) Classification of malware using structured control flow. In: Proceedings of the eighth Australasian symposium on parallel and distributed computing, vol 107. Australian Computer Society, Inc., Darlinghurst, Australia, AusPDC ’10, pp 61–70
Cohen WW (1995) Fast effective rule induction. In: Machine learning-international workshop then conference. Morgan Kaufmann Publishers, Inc., pp 115–123
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018
MATH Google Scholar
Demme J, Maycock M, Schmitz J, Tang A, Waksman A, Sethumadhavan S, Stolfo S (2013) On the feasibility of online malware detection with performance counters. In: Proceedings of the 40th annual international symposium on computer architecture. ACM, New York, ISCA ’13, pp 559–570. doi:10.1145/2485922.2485970
Enck W, Octeau D, McDaniel P, Chaudhuri S (2011) A study of android application security. In: Proceedings of the 20th USENIX conference on security. USENIX Association, Berkeley, SEC’11, pp 21–21. http://dl.acm.org/citation.cfm?id=2028067.2028088
Felt AP, Finifter M, Chin E, Hanna S, Wagner D (2011) A survey of mobile malware in the wild. In: Proceedings of the 1st ACM workshop on security and privacy in smartphones and mobile devices. ACM, New York, SPSM ’11, pp 3–14. doi:10.1145/2046614.2046618
Google (2012) Android and security (bouncer announcement). http://googlemobile.blogspot.fr/2012/02/android-and-security.html. Accessed 14 June 2014
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Article Google Scholar
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. doi:10.1109/TKDE.2008.239
Article Google Scholar
Henchiri O, Japkowicz N (2006) A feature selection and evaluation scheme for computer virus detection. In: Proceedings of the sixth international conference on data mining. IEEE Computer Society, Washington, DC, ICDM ’06, pp 891–895. doi:10.1109/ICDM.2006.4
Jacob A, Gokhale M (2007) Language classification using n-grams accelerated by fpga-based bloom filters. In: Proceedings of the 1st international workshop on high-performance reconfigurable computing technology and applications: held in conjunction with SC07. Reno, Nevada, HPRCTA ’07, pp 31–37
Kephart JO (1994) A biologically inspired immune system for computers. In: Artificial life IV: proceedings of the fourth international workshop on the synthesis and simulation of living systems. MIT Press, pp 130–139
Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744. http://dl.acm.org/citation.cfm?id=1248547.1248646
MathSciNet MATH Google Scholar
McLachlan G, Do KA, Ambroise C (2005) Analyzing microarray gene expression data, vol 422. Wiley.com
Perdisci R, Lanzi A, Lee W (2008a) Classification of packed executables for accurate computer virus detection. Pattern Recogn Lett 29(14):1941–1946. http://www.sciencedirect.com/science/article/pii/S0167865508002110
Article Google Scholar
Perdisci R, Lanzi A, Lee W (2008b) Mcboost: boosting scalability in malware collection and analysis using statistical classification of executables. In: Computer security applications conference, 2008. ACSAC 2008. Annual, pp 301–310. doi:10.1109/ACSAC.2008.22
Pieterse H, Olivier M (2012) Android botnets on the rise: trends and characteristics. In: Information security for South Africa (ISSA), 2012, pp 1–5. doi:10.1109/ISSA.2012.6320432
Pouik G (2012) Similarities for fun & profit. Phrack 14(68). http://www.phrack.org/issues.html?id=15&issue=68
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann
Rossow C, Dietrich C, Grier C, Kreibich C, Paxson V, Pohlmann N, Bos H, van Steen M (2012) Prudent practices for designing malware experiments: status quo and outlook. In: 2012 IEEE symposium on security and privacy (SP), pp 65–79. doi:10.1109/SP.2012.14
Sahs J, Khan L (2012) A machine learning approach to android malware detection. In: 2012 European intelligence and security informatics conference (EISIC). IEEE, pp 141–147. doi:10.1109/EISIC.2012.34
Santos I, Penya YK, Devesa J, Bringas PG (2009) N-grams-based file signatures for malware detection. In: ICEIS, pp 317–320
Schultz M, Eskin E, Zadok E, Stolfo S (2001) Data mining methods for detection of new malicious executables. In: Proceedings 2001 IEEE symposium on security and privacy, 2001. S P 2001, pp 38–49. doi:10.1109/SECPRI.2001.924286
Tahan G, Rokach L, Shahar Y (2012) Mal-id: automatic malware detection using common segment analysis and meta-features. J Mach Learn Res 98888:949–979
MathSciNet Google Scholar
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning. ACM, New York, ICML ’07, pp 935–942. doi:10.1145/1273496.1273614
Wu DJ, Mao CH, Wei TE, Lee HM, Wu KP (2012) Droidmat: Android malware detection through manifest and api calls tracing. In: 2012 seventh Asia joint conference on information security (Asia JCIS), pp 62–69. doi:10.1109/AsiaJCIS.2012.18
Yerima S, Sezer S, McWilliams G, Muttik I (2013) A new android malware detection approach using bayesian classification. In: 2013 IEEE 27th international conference on advanced information networking and applications (AINA), pp 121–128. doi:10.1109/AINA.2013.88
Zhang B, Yin J, Hao J, Zhang D, Wang S (2007) Malicious codes detection based on ensemble learning. In: Proceedings of the 4th international conference on autonomic and trusted computing. Springer, Berlin, Heidelberg, ATC’07, pp 468–477
Zhou Y, Jiang X (2012) Dissecting android malware: characterization and evolution. In: Proceedings of the 2012 IEEE symposium on security and privacy. IEEE Computer Society, Washington, DC, SP ’12, pp 95–109. doi:10.1109/SP.2012.16

Download references

Acknowledgments

We would like to thank VirusTotal for providing us the ability to leverage their infrastructure and detection report databases to build a reference classification as described in Section 3.2.

Author information

Authors and Affiliations

Interdisciplinary Center for Security, Reliability and Trust, University of Luxembourg, 4 rue Alphonse Weicker, 2721, Luxembourg, Luxembourg
Kevin Allix, Tegawendé F. Bissyandé, Quentin Jérome, Jacques Klein, Radu State & Yves Le Traon

Authors

Kevin Allix
View author publications
You can also search for this author in PubMed Google Scholar
Tegawendé F. Bissyandé
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Jérome
View author publications
You can also search for this author in PubMed Google Scholar
Jacques Klein
View author publications
You can also search for this author in PubMed Google Scholar
Radu State
View author publications
You can also search for this author in PubMed Google Scholar
Yves Le Traon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kevin Allix.

Additional information

Communicated by: Jeffrey C. Carver

Appendix

Table 1 Recent research in Machine Learning-based Android Malware Detection

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Allix, K., Bissyandé, T.F., Jérome, Q. et al. Empirical assessment of machine learning-based malware detectors for Android. Empir Software Eng 21, 183–211 (2016). https://doi.org/10.1007/s10664-014-9352-6

Download citation

Published: 24 December 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10664-014-9352-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirical assessment of machine learning-based malware detectors for Android

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities

Assessing the opportunity of combining state-of-the-art Android malware detectors

Android Malware Detection in Large Dataset: Smart Approach

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Empirical assessment of machine learning-based malware detectors for Android

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities

Assessing the opportunity of combining state-of-the-art Android malware detectors

Android Malware Detection in Large Dataset: Smart Approach

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation