Automated Identification of Sensitive Data via Flexible User Requirements

Yang, Ziqi; Liang, Zhenkai

doi:10.1007/978-3-030-01701-9_9

Ziqi Yang¹⁹ &
Zhenkai Liang¹⁹

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 254))

Included in the following conference series:

International Conference on Security and Privacy in Communication Systems

1576 Accesses
2 Citations

Abstract

Protecting sensitive data in web and mobile applications requires identifying sensitive data, which typically needs intensive manual efforts. In addition, deciding sensitive data subjects to users’ requirements and the application context. Existing research efforts on identifying sensitive data from its descriptive texts focus on keyword/phrase searching. These approaches can have high false positives/negatives as they do not consider the semantics of the descriptions. In this paper, we propose S3, an automated approach to identify sensitive data based on user requirements. It considers semantic, syntactic and lexical information comprehensively, aiming to identify sensitive data by the semantics of its descriptive texts. We introduce the notion concept space to represent the user’s notion of privacy, by which our approach can support flexible user requirements in defining sensitive data. Our approach is able to learn users’ preferences from readable concepts initially provided by users, and automatically identify related sensitive data. We evaluate our approach on over 18,000 top popular applications from Google Play Store. S3 achieves an average precision of 89.2%, and average recall 95.8% in identifying sensitive data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Automated identification of sensitive data from implicit user specification

Article Open access 29 September 2018

ICIS: A Model for Context-Based Classification of Sensitive Personal Information

KnIGHT: Mapping Privacy Policies to GDPR

Notes

1.
S3 stands for semantics, syntax, and sentiment.

References

Avdiienko, V., Kuznetsov, K., Rommelfanger, I., Rau, A., Gorla, A., Zeller, A.: Detecting behavior anomalies in graphical user interfaces. In: Proceedings of the 39th International Conference on Software Engineering Companion (ICSE-C). IEEE (2017)
Google Scholar
Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. European Language Resources Association (2010)
Google Scholar
Budianto, E., Jia, Y., Dong, X., Saxena, P., Liang, Z.: You can’t be me: enabling trusted paths and user sub-origins in web browsers. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 150–171. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11379-1_8
Chapter Google Scholar
Bursztein, E., Soman, C., Boneh, D., Mitchell, J.C.: SessionJuggler: secure web login from an untrusted terminal using session hijacking. In: Proceedings of the 21st International Conference on World Wide Web (WWW). ACM (2012)
Google Scholar
CNBC: Driver’s license, credit card numbers: The equifax hack is way worse than consumers knew. https://www.cnbc.com/2018/02/12/the-equifax-hack-is-way-worse-than-consumers-knew.html
Cunningham, P., Delany, S.J.: K-nearest neighbour classifiers. Multiple Classif. Syst. 34, 1–17 (2007)
Google Scholar
Enck, W., et al.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (USENIX OSDI). USENIX Association (2010)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics (2005)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI). Morgan Kaufmann Publishers Inc. (2007)
Google Scholar
Huang, J., et al.: SUPOR: precise and scalable sensitive user input detection for android apps. In: 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)
Google Scholar
Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Proceedings of the 15th International Conference on Neural Information Processing Systems (NIPS). MIT Press (2002)
Google Scholar
Kong, D., Cen, L., Jin, H.: AUTOREB: automatically understanding the review-to-behavior fidelity in android applications. In: Proceedings of the 22nd Conference on Computer and Communications Security (CCS). ACM (2015)
Google Scholar
LDC: English gigaword fifth edition. https://catalog.ldc.upenn.edu/LDC2011T07
Li, X., Hu, H., Bai, G., Jia, Y., Liang, Z., Saxena, P.: DroidVault: a trusted data vault for android devices. In: Proceedings of the 19th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE (2014)
Google Scholar
Liao, X., Yuan, K., Wang, X., Li, Z., Xing, L., Beyah, R.: Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence. In: Proceedings of Conference on Computer and Communications Security (CCS). ACM (2016)
Google Scholar
Lu, K., et al.: Checking more and alerting less: detecting privacy leakages via enhanced data-flow analysis and peer voting. In: Proceedings of the Network and Distributed System Security Symposium (NDSS) (2015)
Google Scholar
Mannan, M., van Oorschot, P.C.: Using a personal device to strengthen password authentication from an untrusted computer. In: Dietrich, S., Dhamija, R. (eds.) FC 2007. LNCS, vol. 4886, pp. 88–103. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77366-5_11
Chapter Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc. (2013)
Google Scholar
Nan, Y., Yang, M., Yang, Z., Zhou, S., Gu, G., Wang, X.: UIPicker: user-input privacy identification in mobile applications. In: Proceedings of the 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)
Google Scholar
Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-76917-0
Book MATH Google Scholar
Oprea, A., Balfanz, D., Durfee, G., Smetters, D.K.: Securing a remote terminal application with a mobile trusted device. In: Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC). IEEE (2004)
Google Scholar
Pandita, R., Xiao, X., Yang, W., Enck, W., Xie, T.: WHYPER: towards automating risk assessment of mobile applications. In: Proceedings of the 22nd USENIX Security Symposium (USENIX Security). USENIX Association (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)
Google Scholar
Qu, Z., Rastogi, V., Zhang, X., Chen, Y., Zhu, T., Chen, Z.: AutoCog: measuring the description-to-permission fidelity in android applications. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM (2014)
Google Scholar
Rastogi, V., Chen, Y., Enck, W.: AppsPlayground: automatic security analysis of smartphone applications. In: Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy. ACM (2013)
Google Scholar
Roalter, L., Kranz, M., Diewald, S., Möller, A., Synnes, K.: The smartphone as mobile authorization proxy. In: Proceedings of the 14th International Conference on Computer Aided Systems Theory (EUROCAST), pp. 306–307 (2013)
Google Scholar
Sharp, R., Madhavapeddy, A., Want, R., Pering, T.: Enhancing web browsing security on public terminals using mobile composition. In: Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM (2008)
Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, Boston, vol. 400, pp. 525–526 (2000)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL). Association for Computational Linguistics (2003)
Google Scholar
Wikipedia: Yahoo! data breaches. https://en.wikipedia.org/wiki/Yahoo!_data_breaches
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM (1996)
Google Scholar
Yu, L., Luo, X., Qian, C., Wang, S.: Revisiting the description-to-behavior fidelity in android applications. In: Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE (2016)
Google Scholar
Zhou, Y., Jiang, X.: Detecting passive content leaks and pollution in android applications. In: Proceedings of the 20th Network and Distributed System Security Symposium (NDSS) (2013)
Google Scholar
Zhou, Y., Evans, D.: Protecting private web content from embedded scripts. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 60–79. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2_4
Chapter Google Scholar

Download references

Acknowledgment

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity R&D Programme (Grant No. NRF2015NCR-NCR002-001).

Author information

Authors and Affiliations

National University of Singapore, Singapore, Singapore
Ziqi Yang & Zhenkai Liang

Authors

Ziqi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenkai Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziqi Yang .

Editor information

Editors and Affiliations

Klaus Advanced Computing Building, Georgia Institute of Technology, Atlanta, GA, USA
Raheem Beyah
Singapore Management University, Singapore, Singapore
Bing Chang
School of Information Systems, Singapore Management University, Singapore, Singapore
Yingjiu Li
Pennsylvania State University, University Park, PA, USA
Sencun Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Liang, Z. (2018). Automated Identification of Sensitive Data via Flexible User Requirements. In: Beyah, R., Chang, B., Li, Y., Zhu, S. (eds) Security and Privacy in Communication Networks. SecureComm 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 254. Springer, Cham. https://doi.org/10.1007/978-3-030-01701-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-01701-9_9
Published: 29 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01700-2
Online ISBN: 978-3-030-01701-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics