research-article

Distributed Outlier Detection using Compressive Sensing

Authors:

Thomas MoscibrodaAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 3 - 16

https://doi.org/10.1145/2723372.2747641

Published: 27 May 2015 Publication History

Abstract

Computing outliers and related statistical aggregation functions from large-scale big data sources is a critical operation in many cloud computing scenarios, e.g. service quality assurance, fraud detection, or novelty discovery. Such problems commonly have to be solved in a distributed environment where each node only has a local slice of the entirety of the data. To process a query on the global data, each node must transmit its local slice of data or an aggregated subset thereof to a global aggregator node, which can then compute the desired statistical aggregation function. In this context, reducing the total communication cost is often critical to the overall efficiency.

In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure. Specifically, we devise a new aggregation paradigm for outlier detection and related queries. The paradigm leverages compressive sensing for data sketching in combination with outlier detection techniques. We further propose an algorithm that works even for non-sparse data that concentrates around an unknown value. In both cases, we show that the communication cost is reduced to the logarithm of the global data size. We incorporate our approach into Hadoop and evaluate it on real web-scale production data (distributed click-data logs). Our approach reduces data shuffling IO by up to 99%, and end-to-end job duration by up to 40% on many actual production queries.

References

[1]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, STOC '96, pages 20--29, New York, NY, USA, 1996. ACM.

Digital Library

[2]

M. Andrecut. Fast gpu implementation of sparse signal recovery from random projections. Engineering Letters, 17(3):151--158, 2009.

[3]

B. Babcock and C. Olston. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 28--39. ACM, 2003.

Digital Library

[4]

W.-T. Balke and W. Kießling. Optimizing multi-feature queries for image databases. VLDB,(Sep 2000), pages 10--14, 2000.

[5]

R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253--263, 2008.

[6]

T. Blumensath and M. E. Davies. On the difference between orthogonal matching pursuit and orthogonal least squares, 2007.

[7]

T. Bu, J. Cao, A. Chen, and P. P. Lee. A fast and compact method for unveiling significant patterns in high speed networks. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 1893--1901. IEEE, 2007.

Digital Library

[8]

E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. Information Theory, IEEE Transactions on, 52(2):489--509, 2006.

Digital Library

[9]

E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406--5425, 2006.

Digital Library

[10]

P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In S. Chaudhuri and S. Kutten, editors, PODC, pages 206--215. ACM, 2004.

Digital Library

[11]

M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Automata, Languages and Programming, pages 693--703. Springer, 2002.

[12]

S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998.

Digital Library

[13]

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM journal on scientific computing, 20(1):33--61, 1998.

Digital Library

[14]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1115--1118. ACM, 2010.

Digital Library

[15]

D. L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289--1306, 2006.

Digital Library

[16]

M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. Single-pixel imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83--91, 2008.

[17]

M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Algorithms-ESA 2003, pages 605--617. Springer, 2003.

[18]

R. Fagin. Combining fuzzy information from multiple systems. In Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 216--226. ACM, 1996.

Digital Library

[19]

R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In P. Buneman, editor, PODS. ACM, 2001.

Digital Library

[20]

Y. Fang, L. Chen, J. Wu, and B. Huang. Gpu implementation of orthogonal matching pursuit for compressive sensing. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, ICPADS '11, pages 1044--1047, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[21]

P. Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182--209, 1985.

Digital Library

[22]

P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD '98, pages 331--342, New York, NY, USA, 1998. ACM.

Digital Library

[23]

Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, pages 121--133, 2012.

Digital Library

[24]

I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), 40(4):11, 2008.

Digital Library

[25]

B. Kalyanasundaram and G. Schintger. The probabilistic communication complexity of set intersection. SIAM J. Discret. Math., 5(4):545--557, Nov. 1992.

Digital Library

[26]

F. Kuhn, T. Locher, and S. Schmid. Distributed computation of the mode. In Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing, pages 15--24. ACM, 2008.

Digital Library

[27]

F. Kuhn, T. Locher, and R. Wattenhofer. Tight bounds for distributed selection. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 145--153. ACM, 2007.

Digital Library

[28]

E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, New York, NY, USA, 1997.

Digital Library

[29]

N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. Proceedings of the VLDB Endowment, 5(10):1028--1039, 2012.

Digital Library

[30]

S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397--3415, 1993.

Digital Library

[31]

A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Transactions on Database Systems (TODS), 29(2):319--362, 2004.

Digital Library

[32]

S. Nepal and M. Ramakrishna. Query processing issues in image (multimedia) databases. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 22--29. IEEE, 1999.

Digital Library

[33]

T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proceedings of the VLDB Endowment, 3(1--2):494--505, 2010.

Digital Library

[34]

Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pages 40--44. IEEE, 1993.

[35]

B. Patt-Shamir. A note on efficient aggregate queries in sensor networks. Theoretical Computer Science, 370(1):254--264, 2007.

Digital Library

[36]

A. A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385--390, 1992.

Digital Library

[37]

J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655--4666, 2007.

Digital Library

[38]

G. Wang and C.-Y. Chan. Multi-query optimization in mapreduce framework. Proceedings of the VLDB Endowment, 7(3), 2013.

Digital Library

[39]

J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210--227, 2009.

Digital Library

[40]

Y. Yan, L. J. Chen, and Z. Zhang. Error-bounded sampling for analytics on big sparse data. PVLDB, 7(13):1508--1519, 2014.

Digital Library

[41]

J. Zhang, Y. Yan, L. J. Chen, M. Wang, T. Moscibroda, and Z. Zhang. Impression store: Compressive sensing-based storage for big data analytics. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14), Philadelphia, PA, June 2014. USENIX Association.

Digital Library

[42]

J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In NSDI, volume 12, pages 22--22, 2012.

Digital Library

[43]

Y. Zhang, M. Roughan, W. Willinger, and L. Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009.

Digital Library

Cited By

Moon AZhuo XZhang JSon SJeong Song Y(2020)Anomaly Detection in Edge Nodes using Sparsity Profile2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377757(1236-1245)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377757
Toliopoulos TGounaris ATsichlas KPapadopoulos ASampaio S(2020)Continuous outlier mining of streaming data in flinkInformation Systems10.1016/j.is.2020.10156993(101569)Online publication date: Nov-2020
https://doi.org/10.1016/j.is.2020.101569
Paganelli MSottovia PMaccioni AInterlandi MGuerra F(2020)Explaining data with descriptionsInformation Systems10.1016/j.is.2020.10154992(101549)Online publication date: Sep-2020
https://doi.org/10.1016/j.is.2020.101549
Show More Cited By

Index Terms

Distributed Outlier Detection using Compressive Sensing
1. Information systems
  1. Information systems applications
  2. World Wide Web
    1. Web applications
      1. Internet communications tools

Recommendations

Image compressive sensing via Truncated Schatten-p Norm regularization

Low-rank property as a useful image prior has attracted much attention in image processing communities. Recently, a nonlocal low-rank regularization (NLR) approach toward exploiting low-rank property has shown the state-of-the-art performance in ...
Compressive sensing via nonlocal low-rank tensor regularization

The aim of Compressing sensing (CS) is to acquire an original signal, when it is sampled at a lower rate than Nyquist rate previously. In the framework of CS, the original signal is often assumed to be sparse and correlated in some domain. Recently, ...
Image compressive sensing recovery using adaptively learned sparsifying basis via L0 minimization

From many fewer acquired measurements than suggested by the Nyquist sampling theory, compressive sensing (CS) theory demonstrates that, a signal can be reconstructed with high probability when it exhibits sparsity in some domain. Most of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,514
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Moon AZhuo XZhang JSon SJeong Song Y(2020)Anomaly Detection in Edge Nodes using Sparsity Profile2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377757(1236-1245)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9377757
Toliopoulos TGounaris ATsichlas KPapadopoulos ASampaio S(2020)Continuous outlier mining of streaming data in flinkInformation Systems10.1016/j.is.2020.10156993(101569)Online publication date: Nov-2020
https://doi.org/10.1016/j.is.2020.101569
Paganelli MSottovia PMaccioni AInterlandi MGuerra F(2020)Explaining data with descriptionsInformation Systems10.1016/j.is.2020.10154992(101549)Online publication date: Sep-2020
https://doi.org/10.1016/j.is.2020.101549
Toliopoulos TGounaris ATsichlas KPapadopoulos ASampaio S(2018)Parallel Continuous Outlier Mining in Streaming Data2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2018.00033(227-236)Online publication date: Oct-2018
https://doi.org/10.1109/DSAA.2018.00033
Chen JZhang Q(2017)Bias-aware sketchesProceedings of the VLDB Endowment10.14778/3099622.309962710:9(961-972)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.14778/3099622.3099627
Kulkarni AShea CAbtahi THomayoun HMohsenin T(2017)Low Overhead CS-Based Heterogeneous Framework for Big Data AccelerationACM Transactions on Embedded Computing Systems10.1145/309294417:1(1-25)Online publication date: 6-Dec-2017
https://dl.acm.org/doi/10.1145/3092944
Kulkarni AMohsenin T(2017)Low Overhead Architectures for OMP Compressive Sensing Reconstruction AlgorithmIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2017.264885464:6(1468-1480)Online publication date: Jun-2017
https://doi.org/10.1109/TCSI.2017.2648854
Wang XShen DBai MNie TKou YYu G(2015)An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional DatasetsJournal of Computer Science and Technology10.1007/s11390-015-1596-030:6(1233-1248)Online publication date: 20-Nov-2015
https://doi.org/10.1007/s11390-015-1596-0

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten