research-article

Modeling Heterogeneous Statistical Patterns in High-dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework

Authors:

Wei XuAuthors Info & Claims

WWW '20: Proceedings of The Web Conference 2020

Pages 1389 - 1399

https://doi.org/10.1145/3366423.3380213

Published: 20 April 2020 Publication History

Abstract

Since the label collecting is prohibitive and time-consuming, unsupervised methods are preferred in applications such as fraud detection. Meanwhile, such applications usually require modeling the intrinsic clusters in high-dimensional data, which usually displays heterogeneous statistical patterns as the patterns of different clusters may appear in different dimensions. Existing methods propose to model the data clusters on selected dimensions, yet globally omitting any dimension may damage the pattern of certain clusters. To address the above issues, we propose a novel unsupervised generative framework called FIRD, which utilizes adversarial distributions to fit and disentangle the heterogeneous statistical patterns. When applying to discrete spaces, FIRD effectively distinguishes the synchronized fraudsters from normal users. Besides, FIRD also provides superior performance on anomaly detection datasets compared with SOTA anomaly detection methods (over 5% average AUC improvement). The significant experiment results on various datasets verify that the proposed method can better model the heterogeneous statistical patterns in high-dimensional data and benefit downstream applications.

References

[1]

Naoki Abe, Bianca Zadrozny, and John Langford. 2006. Outlier detection by active learning. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006. 504–509. https://doi.org/10.1145/1150402.1150459

Digital Library

[2]

Charu C. Aggarwal. 2013. Outlier Analysis. Springer. https://doi.org/10.1007/978-1-4614-6396-2

[3]

Charu C. Aggarwal and Saket Sathe. 2015. Theoretical Foundations and Algorithms for Outlier Ensembles. SIGKDD Explorations 17, 1 (2015), 24–47. https://doi.org/10.1145/2830544.2830549

Digital Library

[4]

Salem Alelyani, Jiliang Tang, and Huan Liu. 2013. Feature Selection for Clustering: A Review. In Data Clustering: Algorithms and Applications. 29–60.

[5]

Andrew R. Barron, Jorma Rissanen, and Bin Yu. 1998. The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Information Theory 44, 6 (1998), 2743–2760. https://doi.org/10.1109/18.720554

Digital Library

[6]

Jianfei Chen, Jun Zhu, Yee Whye Teh, and Tong Zhang. 2018. Stochastic Expectation Maximization with Variance Reduction. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada.7978–7988. http://papers.nips.cc/paper/8021-stochastic-expectation-maximization-with-variance-reduction

[7]

Constantinos Constantinopoulos, Michalis K. Titsias, and Aristidis Likas. 2006. Bayesian Feature and Model Selection for Gaussian Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. 28, 6 (2006), 1013–1018. https://doi.org/10.1109/TPAMI.2006.111

Digital Library

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. 248–255. https://doi.org/10.1109/CVPRW.2009.5206848

[9]

Franck Dufrenois and Jean-Charles Noyer. 2016. One class proximal support vector machines. Pattern Recognition 52(2016), 96–112. https://doi.org/10.1016/j.patcog.2015.09.036

Digital Library

[10]

Markus Goldstein and Andreas Dengel. 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track(2012), 59–63.

[11]

Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering cluster-based local outliers. Pattern Recognition Letters 24, 9-10 (2003), 1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5

Digital Library

[12]

Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, and Christos Faloutsos. 2015. A General Suspiciousness Metric for Dense Blocks in Multimodal Data. In 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14-17, 2015. 781–786. https://doi.org/10.1109/ICDM.2015.61

Digital Library

[13]

Alan Jovic, Karla Brkic, and Nikola Bogunovic. 2015. A review of feature selection methods with applications. In 38th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2015, Opatija, Croatia, May 25-29, 2015. 1200–1205. https://doi.org/10.1109/MIPRO.2015.7160458

[14]

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24, 7 (2002), 881–892. https://doi.org/10.1109/TPAMI.2002.1017616

Digital Library

[15]

Fabian Keller, Emmanuel Müller, and Klemens Böhm. 2012. HiCS: High Contrast Subspaces for Density-Based Outlier Ranking. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012. 1037–1048. https://doi.org/10.1109/ICDE.2012.88

Digital Library

[16]

YongSeog Kim, W. Nick Street, and Filippo Menczer. 2002. Evolutionary model selection in unsupervised learning. Intell. Data Anal. 6, 6 (2002), 531–556. http://content.iospress.com/articles/intelligent-data-analysis/ida00110

Digital Library

[17]

Martin O. Larsson and Johan Ugander. 2011. A concave regularization technique for sparse mixture models. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain.1890–1898. http://papers.nips.cc/paper/4430-a-concave-regularization-technique-for-sparse-mixture-models

[18]

Martin H. C. Law, Anil K. Jain, and Mário A. T. Figueiredo. 2002. Feature Selection in Mixture-Based Clustering. In Advances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002, Vancouver, British Columbia, Canada]. 625–632. http://papers.nips.cc/paper/2308-feature-selection-in-mixture-based-clustering

[19]

Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 21-24, 2005. 157–166. https://doi.org/10.1145/1081870.1081891

Digital Library

[20]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

[21]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa, Italy. 413–422. https://doi.org/10.1109/ICDM.2008.17

[22]

Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2001. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada]. 849–856. http://papers.nips.cc/paper/2092-on-spectral-clustering-analysis-and-an-algorithm

Digital Library

[23]

Søren Feodor Nielsen. 2000. The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6, 3 (2000), 457–489.

[24]

Girish Keshav Palshikar. 2002. The hidden truth-frauds and their control: A critical application for business intelligence. Intelligent Enterprise 5, 9 (2002), 46–51.

[25]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2012. Scikit-learn: Machine Learning in Python. CoRR abs/1201.0490(2012). arxiv:1201.0490http://arxiv.org/abs/1201.0490

[26]

Adrian E Raftery and Nema Dean. 2006. Variable selection for model-based clustering. J. Amer. Statist. Assoc. 101, 473 (2006), 168–178.

[27]

S Benson Edwin Raj and A Annie Portia. 2011. Analysis on credit card fraud detection methods. In 2011 International Conference on Computer, Communication and Electrical Technology (ICCCET). IEEE, 152–156.

[28]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 2383–2392. https://www.aclweb.org/anthology/D16-1264/

[29]

Shebuti Rayana. 2016. ODDS Library. http://odds.cs.stonybrook.edu

[30]

Kijung Shin, Bryan Hooi, and Christos Faloutsos. 2016. M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I. 264–280. https://doi.org/10.1007/978-3-319-46128-1_17

Digital Library

[31]

Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. 2017. D-Cube: Dense-Block Detection in Terabyte-Scale Tensors. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017. 681–689. https://doi.org/10.1145/3018661.3018676

Digital Library

[32]

Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. 2003. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. Technical Report. Department of Electrical And Computer Engineering, University of Miami.

[33]

Cláudia M. V. Silvestre, Margarida G. M. S. Cardoso, and Mário A. T. Figueiredo. 2015. Feature selection for clustering categorical data with an embedded modelling approach. Expert Systems 32, 3 (2015), 444–453. https://doi.org/10.1111/exsy.12082

Digital Library

[34]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. 1631–1642. https://www.aclweb.org/anthology/D13-1170/

[35]

Mahlet G Tadesse, Naijun Sha, and Marina Vannucci. 2005. Bayesian variable selection in clustering high-dimensional data. J. Amer. Statist. Assoc. 100, 470 (2005), 602–617.

[36]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=rJ4km2R5t7

[37]

Arthur J. White, Jason Wyse, and Thomas Brendan Murphy. 2016. Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. Statistics and Computing 26, 1-2 (2016), 511–527. https://doi.org/10.1007/s11222-014-9542-5

Digital Library

[38]

Yue Zhao, Zain Nasrullah, Maciej K. Hryniewicki, and Zheng Li. 2019. LSCP: Locally Selective Combination in Parallel Outlier Ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, Calgary, Alberta, Canada, May 2-4, 2019.585–593. https://doi.org/10.1137/1.9781611975673.66

[39]

Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007. 1151–1157. https://doi.org/10.1145/1273496.1273641

[40]

Arthur Zimek, Matthew Gaudet, Ricardo J. G. B. Campello, and Jörg Sander. 2013. Subsampling for efficient and effective unsupervised outlier detection ensembles. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013. 428–436. https://doi.org/10.1145/2487575.2487676

Digital Library

Cited By

Liu XFan XMa RChen KLi YWang GXu WSerra ESpezzano F(2024)Collaborative Fraud Detection on Large Scale Graph Using Secure Multi-Party ComputationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679863(1473-1482)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679863

Index Terms

Modeling Heterogeneous Statistical Patterns in High-dimensional Data by Adversarial Distributions: An Unsupervised Generative Framework

Index terms have been assigned to the content through auto-classification.

Recommendations

Iterative random projections for high-dimensional data clustering

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Self-tuning clustering for high-dimensional data

Spectral clustering is an important component of clustering method, via tightly relying on the affinity matrix. However, conventional spectral clustering methods 1). equally treat each data point, so that easily affected by the outliers; 2). are ...
ASCRClu: an adaptive subspace combination and reduction algorithm for clustering of high-dimensional data
Abstract
The curse of dimensionality in high-dimensional data is one of the major challenges in data clustering. Recently, a considerable amount of literature has been published on subspace clustering to address this challenge. The main objective of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Proceedings of The Web Conference 2020

April 2020

3143 pages

ISBN:9781450370233

DOI:10.1145/3366423

Editors:
Yennun Huang
Acadmica sinica, Taiwan
,
Irwin King
The Chinese University of Hong Kong, Hong Kong
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
289
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu XFan XMa RChen KLi YWang GXu WSerra ESpezzano F(2024)Collaborative Fraud Detection on Large Scale Graph Using Secure Multi-Party ComputationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679863(1473-1482)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679863

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents