Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2747641acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Distributed Outlier Detection using Compressive Sensing

Published: 27 May 2015 Publication History

Abstract

Computing outliers and related statistical aggregation functions from large-scale big data sources is a critical operation in many cloud computing scenarios, e.g. service quality assurance, fraud detection, or novelty discovery. Such problems commonly have to be solved in a distributed environment where each node only has a local slice of the entirety of the data. To process a query on the global data, each node must transmit its local slice of data or an aggregated subset thereof to a global aggregator node, which can then compute the desired statistical aggregation function. In this context, reducing the total communication cost is often critical to the overall efficiency.
In this paper, we show both theoretically and empirically that these communication costs can be significantly reduced for common distributed computing problems if we take advantage of the fact that production-level big data usually exhibits a form of sparse structure. Specifically, we devise a new aggregation paradigm for outlier detection and related queries. The paradigm leverages compressive sensing for data sketching in combination with outlier detection techniques. We further propose an algorithm that works even for non-sparse data that concentrates around an unknown value. In both cases, we show that the communication cost is reduced to the logarithm of the global data size. We incorporate our approach into Hadoop and evaluate it on real web-scale production data (distributed click-data logs). Our approach reduces data shuffling IO by up to 99%, and end-to-end job duration by up to 40% on many actual production queries.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, STOC '96, pages 20--29, New York, NY, USA, 1996. ACM.
[2]
M. Andrecut. Fast gpu implementation of sparse signal recovery from random projections. Engineering Letters, 17(3):151--158, 2009.
[3]
B. Babcock and C. Olston. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 28--39. ACM, 2003.
[4]
W.-T. Balke and W. Kießling. Optimizing multi-feature queries for image databases. VLDB,(Sep 2000), pages 10--14, 2000.
[5]
R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3):253--263, 2008.
[6]
T. Blumensath and M. E. Davies. On the difference between orthogonal matching pursuit and orthogonal least squares, 2007.
[7]
T. Bu, J. Cao, A. Chen, and P. P. Lee. A fast and compact method for unveiling significant patterns in high speed networks. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 1893--1901. IEEE, 2007.
[8]
E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. Information Theory, IEEE Transactions on, 52(2):489--509, 2006.
[9]
E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406--5425, 2006.
[10]
P. Cao and Z. Wang. Efficient top-k query calculation in distributed networks. In S. Chaudhuri and S. Kutten, editors, PODC, pages 206--215. ACM, 2004.
[11]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Automata, Languages and Programming, pages 693--703. Springer, 2002.
[12]
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43, 1998.
[13]
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM journal on scientific computing, 20(1):33--61, 1998.
[14]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online aggregation and continuous query support in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1115--1118. ACM, 2010.
[15]
D. L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289--1306, 2006.
[16]
M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. Single-pixel imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83--91, 2008.
[17]
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Algorithms-ESA 2003, pages 605--617. Springer, 2003.
[18]
R. Fagin. Combining fuzzy information from multiple systems. In Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 216--226. ACM, 1996.
[19]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In P. Buneman, editor, PODS. ACM, 2001.
[20]
Y. Fang, L. Chen, J. Wu, and B. Huang. Gpu implementation of orthogonal matching pursuit for compressive sensing. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, ICPADS '11, pages 1044--1047, Washington, DC, USA, 2011. IEEE Computer Society.
[21]
P. Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182--209, 1985.
[22]
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD '98, pages 331--342, New York, NY, USA, 1998. ACM.
[23]
Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, pages 121--133, 2012.
[24]
I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), 40(4):11, 2008.
[25]
B. Kalyanasundaram and G. Schintger. The probabilistic communication complexity of set intersection. SIAM J. Discret. Math., 5(4):545--557, Nov. 1992.
[26]
F. Kuhn, T. Locher, and S. Schmid. Distributed computation of the mode. In Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing, pages 15--24. ACM, 2008.
[27]
F. Kuhn, T. Locher, and R. Wattenhofer. Tight bounds for distributed selection. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 145--153. ACM, 2007.
[28]
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, New York, NY, USA, 1997.
[29]
N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. Proceedings of the VLDB Endowment, 5(10):1028--1039, 2012.
[30]
S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397--3415, 1993.
[31]
A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over web-accessible databases. ACM Transactions on Database Systems (TODS), 29(2):319--362, 2004.
[32]
S. Nepal and M. Ramakrishna. Query processing issues in image (multimedia) databases. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 22--29. IEEE, 1999.
[33]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proceedings of the VLDB Endowment, 3(1--2):494--505, 2010.
[34]
Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pages 40--44. IEEE, 1993.
[35]
B. Patt-Shamir. A note on efficient aggregate queries in sensor networks. Theoretical Computer Science, 370(1):254--264, 2007.
[36]
A. A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385--390, 1992.
[37]
J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655--4666, 2007.
[38]
G. Wang and C.-Y. Chan. Multi-query optimization in mapreduce framework. Proceedings of the VLDB Endowment, 7(3), 2013.
[39]
J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210--227, 2009.
[40]
Y. Yan, L. J. Chen, and Z. Zhang. Error-bounded sampling for analytics on big sparse data. PVLDB, 7(13):1508--1519, 2014.
[41]
J. Zhang, Y. Yan, L. J. Chen, M. Wang, T. Moscibroda, and Z. Zhang. Impression store: Compressive sensing-based storage for big data analytics. In 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14), Philadelphia, PA, June 2014. USENIX Association.
[42]
J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In NSDI, volume 12, pages 22--22, 2012.
[43]
Y. Zhang, M. Roughan, W. Willinger, and L. Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009.

Cited By

View all
  • (2020)Anomaly Detection in Edge Nodes using Sparsity Profile2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377757(1236-1245)Online publication date: 10-Dec-2020
  • (2020)Continuous outlier mining of streaming data in flinkInformation Systems10.1016/j.is.2020.10156993(101569)Online publication date: Nov-2020
  • (2020)Explaining data with descriptionsInformation Systems10.1016/j.is.2020.10154992(101549)Online publication date: Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big sparse data
  2. compressive sensing
  3. distributed aggregation
  4. outlier detection

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Anomaly Detection in Edge Nodes using Sparsity Profile2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9377757(1236-1245)Online publication date: 10-Dec-2020
  • (2020)Continuous outlier mining of streaming data in flinkInformation Systems10.1016/j.is.2020.10156993(101569)Online publication date: Nov-2020
  • (2020)Explaining data with descriptionsInformation Systems10.1016/j.is.2020.10154992(101549)Online publication date: Sep-2020
  • (2018)Parallel Continuous Outlier Mining in Streaming Data2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA.2018.00033(227-236)Online publication date: Oct-2018
  • (2017)Bias-aware sketchesProceedings of the VLDB Endowment10.14778/3099622.309962710:9(961-972)Online publication date: 1-May-2017
  • (2017)Low Overhead CS-Based Heterogeneous Framework for Big Data AccelerationACM Transactions on Embedded Computing Systems10.1145/309294417:1(1-25)Online publication date: 6-Dec-2017
  • (2017)Low Overhead Architectures for OMP Compressive Sensing Reconstruction AlgorithmIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2017.264885464:6(1468-1480)Online publication date: Jun-2017
  • (2015)An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional DatasetsJournal of Computer Science and Technology10.1007/s11390-015-1596-030:6(1233-1248)Online publication date: 20-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media