Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2771783.2771795acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

WuKong: a scalable and accurate two-phase approach to Android app clone detection

Published: 13 July 2015 Publication History

Abstract

Repackaged Android applications (app clones) have been found in many third-party markets, which not only compromise the copyright of original authors, but also pose threats to security and privacy of mobile users. Both fine-grained and coarse-grained approaches have been proposed to detect app clones. However, fine-grained techniques employing complicated clone detection algorithms are difficult to scale to hundreds of thousands of apps, while coarse-grained techniques based on simple features are scalable but less accurate. This paper proposes WuKong, a two-phase detection approach that includes a coarse-grained detection phase to identify suspicious apps by comparing light-weight static semantic features, and a fine-grained phase to compare more detailed features for only those apps found in the first phase. To further improve the detection speed and accuracy, we also introduce an automated clustering-based preprocessing step to filter third-party libraries before conducting app clone detection. Experiments on more than 100,000 Android apps collected from five Android markets demonstrate the effectiveness and scalability of our approach.

References

[1]
Daily Android activations grow to 1.5 million, Google Play surpasses 50 billion downloads. http://bgr.com/ 2013/07/20/android-activations-app-downloads/, 2013.
[2]
Androguard. https://code.google.com/p/ androguard/, 2014.
[3]
Anzhi market. http://www.anzhi.com/, 2014.
[4]
Apimonitor. https://code.google.com/p/droidbox/ wiki/APIMonitor, 2014.
[5]
Apktool. https://code.google.com/p/androidapktool/, 2014.
[6]
Baidu market. http://shouji.baidu.com/, 2014.
[7]
Dex2jar. https://code.google.com/p/dex2jar, 2014.
[8]
Eoe market. http://www.eoemarket.com/, 2014.
[9]
Gfan market. http://apk.gfan.com/, 2014.
[10]
Jd-Core-Java. https://github.com/nviennot/jdcore-java, 2014.
[11]
Keytool. http://docs.oracle.com/javase/6/docs/ technotes/tools/solaris/keytool.html, 2014.
[12]
A list of shared libraries and Ad libraries used in Android apps. http://sites.psu.edu/kaichen/2014/ 02/20/a-list-of-shared-libraries-and-adlibraries-used-in-android-apps/, 2014.
[13]
Myapp market. http://android.myapp.com/, 2014.
[14]
Proguard. https://proguard.sourceforge.net/, 2014.
[15]
Smali: An assembler/disassembler for Android’s dex format. https://code.google.com/p/smali, 2014.
[16]
B. S. Baker. A program for identifying duplicated code. In Computer Science and Statistics: Proc. Symp. on the Interface, pages 49–57, 1992.
[17]
B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE, pages 86–95, 1995.
[18]
B. S. Baker. Parameterized pattern matching: algorithms and applications. J. Comput. Syst. Sci., 52(1):28–42, 1996.
[19]
I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L.Bier. Clone detection using abstract syntax trees. In Proceedings of the 1998 International Conference on Software Maintenance (ICSM), 1998.
[20]
P. Bulychev and M. Minea. Duplicate code detection using anti-unification. In SYRCOSE, 2008.
[21]
K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14), 2014.
[22]
X. Chen, A. Y. Wang, and E. D. Tempero. A replication and reproduction of code clone detection studies. In Proceedings of the Thirty-Seventh Australasian Computer Science Conference (ACSC), pages 105–114, 2014.
[23]
A. Corazza, S. Di Martino, V. Maggio, and G. Scanniello. A tree kernel based approach for clone detection. In Proceedings of the 2010 International Conference on Software Maintenance (ICSM ’10), pages 1–5, 2010.
[24]
J. Crussell, C. Gibler, and H. Chen. Attack of the clones: detecting cloned applications on Android markets. In Proceedings of the 17th European Symposium on Research in Computer Security (ESORICS ’12), 2012.
[25]
J. Crussell, C. Gibler, and H. Chen. Scalable semantics-based detection of similar Android applications. In Proceedings of the 18th European Symposium on Research in Computer Security (ESORICS ’13), 2013.
[26]
C. Gibler, R. Stevens, J. Crussell, H. Chen, H. Zang, and H. Choi. AdRob: examining the landscape and impact of Android application plagiarism. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’13), pages 431–444, 2013.
[27]
S. Hanna, L. Huang, E. Wu, S. Li, C. Chen, and D. Song. Juxtapp: a scalable system for detecting code reuse among Android applications. In Proceedings of the 9th Conference on Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA ’12), 2012.
[28]
Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto. Incremental code clone detection: a PDG-based approach. In WCRE, pages 3 –12, 2011.
[29]
H. Huang, S. Zhu, P. Liu, and D. Wu. A framework for evaluating mobile app repackaging detection algorithm. In Proceedings of the 6th International Conference on Trust and Trustworthy Computing, 2013.
[30]
Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, and D. Wu. Value-based program characterization and its application to software plagiarism detection. In Proceedings of the 33rd International Conference on Software Engineering, pages 756–765, 2011.
[31]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE ’07), pages 96–105, 2007.
[32]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transaction on Software Engineering, 28(7):654–670, 2002.
[33]
H. Kim, Y. Jung, S. Kim, and K. Yi. MeCC: Memory comparison-based clone detector. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11), pages 301–310, 2011.
[34]
J. Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digit. Investig., 3:91–97, Sept. 2006.
[35]
J. Krinke. Identifying similar code with program dependence graphs. In WCRE, pages 301–309, 2001.
[36]
M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE ’10), pages 167–176, 2010.
[37]
S. Lee and I. Jeong. SDD: high performance code clone detection system for large scale source code. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’05), pages 140–141, 2005.
[38]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Transaction on Software Engineering, 32(3):176–192, 2006.
[39]
H.-i. Lim, H. Park, S. Choi, and T. Han. Detecting theft of Java applications via a static birthmark based on weighted stack patterns. IEICE - Trans. Inf. Syst., E91-D(9):2323–2332, 2008.
[40]
H.-i. Lim, H. Park, S. Choi, and T. Han. A method for detecting the theft of Java programs through analysis of the control flow information. Inf. Softw. Technol., 51(9):1338–1350, 2009.
[41]
M. Linares-Vásquez, A. Holtzhauer, C. Bernal-Cárdenas, and D. Poshyvanyk. Revisiting Android reuse studies in the context of code obfuscation and library usages. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 242–251. ACM, 2014.
[42]
B. Liu, B. Liu, H. Jin, and R. View. Efficient privilege de-escalation for ad libraries in mobile apps. In Proceedings of the The 13th International Conference on Mobile Systems, Applications, and Services (MobiSys ’15), 2015.
[43]
C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 872–881, 2006.
[44]
C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12), pages 364–374, 2012.
[45]
G. Myles and C. Collberg. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, pages 314–318.
[46]
G. Myles and C. Collberg. Detecting software theft via whole program path birthmarks. In Information security, pages 404–415, 2004.
[47]
C. K. Roy and J. R. Cordy. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 2008 IEEE International Conference on Program Comprehension, pages 172–181, 2008.
[48]
P. Schugerl. Scalable clone detection using description logic. In IWSC ’11, pages 47–53, 2011.
[49]
D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for Java. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’07), pages 274–283, 2007.
[50]
G. Selim, K. C. Foo, and Y. Zou. Enhancing source-based clone detection using intermediate representation. In WCRE, pages 227 –236, 2010.
[51]
H. Tamada, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of birthmarks for detecting theft of Java programs. In Proceedings of the IASTED International Conference on Software Engineering, pages 569–575, 2004.
[52]
H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of dynamic software birthmarks based on API calls. Technical report, Nara Institute of Science and Technology, 2007.
[53]
H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K.-I. Matsumoto. Dynamic software birthmarks to detect the theft of Windows applications. In Proceedings of the International Symposium on Future Software Technology (ISFST ’04), 2004.
[54]
N. Viennot, E. Garcia, and J. Nieh. A measurement study of Google Play. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’14), pages 221–233, 2014.
[55]
H. Wang, Z. Wang, Y. Guo, and X. Chen. Detecting repackaged Android applications based on code clone detection technique. In SCIENCE CHINA Information Sciences, volume 44(1), pages 142–157, 2014.
[56]
X. Wang, Y. chan Jhi, S. Zhu, and P. Liu. Detecting software theft via system call based birthmarks. In Proceedings of the 2009 Annual Computer Security Applications Conference, pages 149–158, 2009.
[57]
X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Behavior based software theft detection. In Proceedings of the 16th ACM Conference on Computer and Communications Security, pages 280–290, 2009.
[58]
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), pages 1113–1120, 2009.
[59]
Y. Yuan and Y. Guo. CMCD: count matrix based code clone detection. In Proceedings of the 18th Asia Pacific Software Engineering Conference (APSEC ’11), pages 250–257, 2011.
[60]
Y. Yuan and Y. Guo. Boreas: an accurate and scalable token-based approach to code clone detection. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE ’12), pages 286–289, 2012.
[61]
F. Zhang, H. Huang, S. Zhu, D. Wu, and P. Liu. ViewDroid: towards obfuscation-resilient mobile application repackaging detection. In Proceedings of the 7th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec ’14), 2014.
[62]
Y. Zhauniarovich, O. Gadyatskaya, B. Crispo, F. La Spina, and E. Moser. FSquaDRA: fast detection of repackaged applications. In Data and Applications Security and Privacy XXVIII, volume 8566 of Lecture Notes in Computer Science, pages 130–145. 2014.
[63]
W. Zhou, Y. Zhou, M. Grace, X. Jiang, and S. Zou. Fast, scalable detection of “piggybacked” mobile applications. In Proceedings of the Third ACM Conference on Data and Application Security and Privacy (CODASPY ’13), pages 185–196, 2013.
[64]
W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting repackaged smartphone applications in third-party Android marketplaces. In Proceedings of the Second ACM Conference on Data and Application Security and Privacy (CODASPY ’12), 2012.
[65]
Y. Zhou and X. Jiang. Dissecting Android malware: characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP ’12), pages 95–109, 2012.

Cited By

View all
  • (2024)Research and Implementation of Open Source Component Library Detection for Binary ProgramsIEEE Access10.1109/ACCESS.2024.344219112(111846-111857)Online publication date: 2024
  • (2024)Android Authorship Attribution Using Source Code-Based FeaturesIEEE Access10.1109/ACCESS.2024.335194512(6569-6589)Online publication date: 2024
  • (2024)Sharing is Not Always Caring: Delving Into Personal Data Transfer Compliance in Android AppsIEEE Access10.1109/ACCESS.2024.334942512(5256-5269)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2015: Proceedings of the 2015 International Symposium on Software Testing and Analysis
July 2015
447 pages
ISBN:9781450336208
DOI:10.1145/2771783
  • General Chair:
  • Michal Young,
  • Program Chair:
  • Tao Xie
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Android
  2. Clone detection
  3. mobile applications
  4. repackaging
  5. third-party library

Qualifiers

  • Research-article

Conference

ISSTA '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Research and Implementation of Open Source Component Library Detection for Binary ProgramsIEEE Access10.1109/ACCESS.2024.344219112(111846-111857)Online publication date: 2024
  • (2024)Android Authorship Attribution Using Source Code-Based FeaturesIEEE Access10.1109/ACCESS.2024.335194512(6569-6589)Online publication date: 2024
  • (2024)Sharing is Not Always Caring: Delving Into Personal Data Transfer Compliance in Android AppsIEEE Access10.1109/ACCESS.2024.334942512(5256-5269)Online publication date: 2024
  • (2023)LibScanProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620427(3385-3402)Online publication date: 9-Aug-2023
  • (2023)Graph Convolutional Network-Based Repository Recommendation SystemComputer Modeling in Engineering & Sciences10.32604/cmes.2023.027287137:1(175-196)Online publication date: 2023
  • (2023)Are our clone detectors good enough? An empirical study of code effects by obfuscationCybersecurity10.1186/s42400-023-00148-x6:1Online publication date: 2-Jul-2023
  • (2023)ANDetect: A Third-party Ad Network Libraries Detection Framework for Android ApplicationsProceedings of the 39th Annual Computer Security Applications Conference10.1145/3627106.3627182(98-112)Online publication date: 4-Dec-2023
  • (2023)JSLibD: Reliable and Heuristic Detection of Third-party Libraries in MiniappsProceedings of the 2023 ACM Workshop on Secure and Trustworthy Superapps10.1145/3605762.3624428(11-16)Online publication date: 26-Nov-2023
  • (2023)Comparing Privacy Labels of Applications in Android and iOSProceedings of the 22nd Workshop on Privacy in the Electronic Society10.1145/3603216.3624967(61-73)Online publication date: 26-Nov-2023
  • (2023)Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598143(1383-1395)Online publication date: 12-Jul-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media