Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICDM.2005.22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

An Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays

Published: 27 November 2005 Publication History
  • Get Citation Alerts
  • Abstract

    We consider the problem of detecting anomalies in data that arise as multidimensional arrays with each dimension corresponding to the levels of a categorical variable. In typical data mining applications, the number of cells in such arrays are usually large. Our primary focus is detecting anomalies by comparing information at the current time to historical data. Naive approaches advocated in the process control literature do not work well in this scenario due to the multiple testing problem - performing multiple statistical tests on the same data produce excessive number of false positives. We use an Empirical Bayes method which works by fitting a two component gaussian mixture to deviations at current time. The approach is scalable to problems that involve monitoring massive number of cells and fast enough to be potentially useful in many streaming scenarios. We show the superiority of the method relative to a naive "per component error rate" procedure through simulation. A novel feature of our technique is the ability to suppress deviations that are merely the consequence of sharp changes in the marginal distributions. This research was motivated by the need to extract critical application information and business intelligence from the daily logs that accompany large-scale spoken dialog systems deployed by AT&T. We illustrate our method on one such system.

    References

    [1]
    B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, Madison, Wisconsin, USA, 2002.
    [2]
    G. E. Box. Time series analysis: forecasting and control. Holden-Day, 1970.
    [3]
    B.P. Carlin and T.A. Louis. Bayes and Empirical Bayes methods for data analysis 2nd Ed. Chapman and Hall/CRC Press, 2000.
    [4]
    D.B. Duncan. A bayesian approach to multiple comparisons. Technometrics, 7:171-222, 1965.
    [5]
    D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proc. of the 30th VLDB conference, pages 180-191. Toronto, Canada, August 2004.
    [6]
    S. Douglas, D. Agarwal, T. Alonso, R. Bell, M. Rahim, D. F. Swayne, and C. Volinsky. Mining Customer Care Dialogs for "Daily News". In INTERSPEECH-2004, Jeju, Korea, 2004.
    [7]
    W. DuMouchel. A bayesian model and graphical elicitation procedure for multiple comparisons. In J.M. Degroot, M.H. Lindley, D.V. Smith, A.F.M.(Eds.), Bayesian Statistics 3. Oxford University Press. Oxford, England, 1988.
    [8]
    C. Genovese and L. Wasserman. Bayesian and frequentist multiple testing. In Bayesian Statistics 7 - Proc. of the 7th Valencia International Meeting, pages 145-162, 2003.
    [9]
    P. Good. Permutation tests - a practical guide to resampling methods for testing hypotheses. Springer-Verlag, 2nd edition, New York, 2000.
    [10]
    J.P. Shaffer. A semi-bayesian study of duncan's bayesian multiple comparison procedure. Journal of statistical planning and inference, 82:197-213, 1999.
    [11]
    R. Gopalan and D.A. Berry. Bayesian multiple comparisons using dirichlet process priors. Journal of the American Statistical Association, 93:1130-1139, 1998.
    [12]
    J. Scott and J. Berger. "an exploration of aspects of bayesian multiple testing". Technical report, Institute of Statistics and Decision Science, 2003.
    [13]
    V. Ganti, J.E. Gehrke, and R. Ramakrishnan. Mining data streams under block evolution. Sigkdd explorations, 3:1-10, january 2002.
    [14]
    W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing flat files flatter. In Proc. of the 5th ACM SIGKDD conference, pages 6-15. San Diego, California, USA, August 1999.
    [15]
    Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the royal statistical society, series B, 57:289- 300, 1995.
    [16]
    B.-K. Yi, N. Sidiropoulos, T. Johnson, H.V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for coevolving time sequences. In Proc. of the 16th International Conference on Data Engineering, pages 13-22. San Diego, California, USA, March 2000.
    [17]
    Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of the 28th VLDB conference, pages 358-369. HongKong, China, 2002.

    Cited By

    View all
    • (2022)STAD-GAN: Unsupervised Anomaly Detection on Multivariate Time Series with Self-training Generative Adversarial NetworksACM Transactions on Knowledge Discovery from Data10.1145/357278017:5(1-18)Online publication date: 23-Nov-2022
    • (2019)Method of Fraudster Fingerprint Formation During Mobile Application Installations2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)10.1109/IDAACS.2019.8924369(1099-1103)Online publication date: 18-Sep-2019
    • (2018)Data streams anomaly detection algorithm based on self-set thresholdProceedings of the 4th International Conference on Communication and Information Processing10.1145/3290420.3290451(18-26)Online publication date: 2-Nov-2018
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining
    November 2005
    837 pages
    ISBN:0769522785

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 27 November 2005

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)STAD-GAN: Unsupervised Anomaly Detection on Multivariate Time Series with Self-training Generative Adversarial NetworksACM Transactions on Knowledge Discovery from Data10.1145/357278017:5(1-18)Online publication date: 23-Nov-2022
    • (2019)Method of Fraudster Fingerprint Formation During Mobile Application Installations2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)10.1109/IDAACS.2019.8924369(1099-1103)Online publication date: 18-Sep-2019
    • (2018)Data streams anomaly detection algorithm based on self-set thresholdProceedings of the 4th International Conference on Communication and Information Processing10.1145/3290420.3290451(18-26)Online publication date: 2-Nov-2018
    • (2016)Change detection in multivariate datastreamsProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060621.3060811(1368-1374)Online publication date: 9-Jul-2016
    • (2015)Temporal Multi-View Inconsistency Detection for Network Traffic AnalysisProceedings of the 24th International Conference on World Wide Web10.1145/2740908.2745399(455-465)Online publication date: 18-May-2015
    • (2014)Research issues in outlier detection for data streamsACM SIGKDD Explorations Newsletter10.1145/2594473.259447915:1(33-40)Online publication date: 17-Mar-2014
    • (2013)SVDD-based outlier detection on uncertain dataKnowledge and Information Systems10.1007/s10115-012-0484-y34:3(597-618)Online publication date: 1-Mar-2013
    • (2011)Active learning and subspace clustering for anomaly detectionIntelligent Data Analysis10.5555/1971751.197175515:2(151-171)Online publication date: 1-Apr-2011
    • (2011)Online outlier detection for data streamsProceedings of the 15th Symposium on International Database Engineering & Applications10.1145/2076623.2076635(88-96)Online publication date: 21-Sep-2011
    • (2009)Anomaly detectionACM Computing Surveys10.1145/1541880.154188241:3(1-58)Online publication date: 30-Jul-2009
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media