Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Scorpion: explaining away outliers in aggregate queries

Published: 01 June 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.

    References

    [1]
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In DMKD, pages 94-105, 1998.
    [2]
    L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, New York, NY, 1984.
    [3]
    Y. Cui, J. Widom, and J. L. Viener. Tracing the lineage of view data in a warehousing environment. In ACM Transactions on Database Systems, 1997.
    [4]
    M. Das, S. Amer-Yahia, G. Das, and C. Yu. Mri: Meaningful interpretations of collaborative ratings. In PVLDB, volume 4, 2011.
    [5]
    J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113, Jan. 2008.
    [6]
    J. Gray, A. Bosworth, A. Layman, D. Reichart, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. pages 152-159, 1996.
    [7]
    B. Kanagal, J. Li, and A. Deshpande. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In SIGMOD, pages 841-852, 2011.
    [8]
    S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Advanced Visual Interfaces, 2012.
    [9]
    N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: debugging mapreduce job performance. VLDB, 5(7):598-609, Mar. 2012.
    [10]
    A. Meliou, W. Gatterbauer, J. Y. Halpern, C. Koch, K. F. Moore, and D. Suciu. Causality in databases. In IEEE Data Eng. Bull., volume 33, pages 59-67, 2010.
    [11]
    A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD '11, pages 505-516, New York, NY, USA, 2011. ACM.
    [12]
    C. Ré and D. Suciu. Approximate lineage for probabilistic databases. Proc. VLDB Endow., 1(1):797-808, Aug. 2008.
    [13]
    Y. Saeys, I. n. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507-2517, Sept. 2007.
    [14]
    A. Saltelli. The critique of modelling and sensitivity analysis in the scientific discourse. an overview of good practices. TAUC, Oct. 2006.
    [15]
    S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, 1999.
    [16]
    S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In EDBT, 1998.
    [17]
    G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional olap data. In VLDB, 2001.
    [18]
    W. Willett, J. Heer, and M. Agrawala. Strategies for crowdsourcing social data analysis. In SIGCHI, pages 227-236, 2012.
    [19]
    E. Wu, S. Madden, and M. Stonebraker. Subzero: a fine-grained lineage system for scientific databases. In ICDE, 2013.

    Cited By

    View all
    • (2024)Outlier Summarization via Human Interpretable RulesProceedings of the VLDB Endowment10.14778/3654621.365462717:7(1591-1604)Online publication date: 1-Mar-2024
    • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
    • (2024)DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable ComputingProceedings of the ACM on Software Engineering10.1145/36437611:FSE(767-788)Online publication date: 12-Jul-2024
    • Show More Cited By

    Index Terms

    1. Scorpion: explaining away outliers in aggregate queries
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 8
      June 2013
      60 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 June 2013
      Published in PVLDB Volume 6, Issue 8

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)1
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Outlier Summarization via Human Interpretable RulesProceedings of the VLDB Endowment10.14778/3654621.365462717:7(1591-1604)Online publication date: 1-Mar-2024
      • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
      • (2024)DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable ComputingProceedings of the ACM on Software Engineering10.1145/36437611:FSE(767-788)Online publication date: 12-Jul-2024
      • (2024)Summarized Causal Explanations For Aggregate ViewsProceedings of the ACM on Management of Data10.1145/36393282:1(1-27)Online publication date: 26-Mar-2024
      • (2024)Relative Keys: Putting Feature Explanation into ContextProceedings of the ACM on Management of Data10.1145/36392632:1(1-28)Online publication date: 26-Mar-2024
      • (2023)Explaining Differentially Private Query Results with DPXPlainProceedings of the VLDB Endowment10.14778/3611540.361159616:12(3962-3965)Online publication date: 1-Aug-2023
      • (2023)A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP RelaxationsProceedings of the ACM on Management of Data10.1145/36267151:4(1-27)Online publication date: 12-Dec-2023
      • (2023)SmokedDuck Demonstration: SQLStepperCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589731(183-186)Online publication date: 4-Jun-2023
      • (2022)FEDEXProceedings of the VLDB Endowment10.14778/3565838.356584115:13(3854-3868)Online publication date: 1-Sep-2022
      • (2022)DPXPlainProceedings of the VLDB Endowment10.14778/3561261.356127116:1(113-126)Online publication date: 1-Sep-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media