DIFF: a relational interface for large-scale data explanation

Abuzaid, Firas; Kraft, Peter; Suri, Sahaana; Gan, Edward; Xu, Eric; Shenoy, Atul; Ananthanarayan, Asvin; Sheu, John; Meijer, Erik; Wu, Xi; Naughton, Jeff; Bailis, Peter; Zaharia, Matei

doi:10.1007/s00778-020-00633-6

DIFF: a relational interface for large-scale data explanation

Special Issue Paper
Published: 30 September 2020

Volume 30, pages 45–70, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Firas Abuzaid ORCID: orcid.org/0000-0002-1424-4554¹,
Peter Kraft¹,
Sahaana Suri¹,
Edward Gan¹,
Eric Xu¹,
Atul Shenoy²,
Asvin Ananthanarayan²,
John Sheu²,
Erik Meijer³,
Xi Wu⁴,
Jeff Naughton⁴,
Peter Bailis¹ &
…
Matei Zaharia¹

754 Accesses
12 Citations
Explore all metrics

Abstract

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Fig. 11

Semantic OLAP Patterns: Elements of Reusable Business Analytics

Dedalo: Looking for Clusters Explanations in a Labyrinth of Linked Data

Data Handling and Manipulation in R with Descriptive Statistics

Notes

To keep ANTI DIFF consistent with - , we also prune all explanations with no support in R.
Our implementation is open source and available at https://github.com/stanford-futuredata/macrobase.
https://support.censys.io/hc/en-us/articles/360038761891-Research-Access-to-Censys-Data.
https://www.cms.gov/OpenPayments/Explore-the-Data/Data-Overview.html.
https://bitbucket.org/xlwang/dataxray-source-code.

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Co. Inc, Boston (1995)
Google Scholar
Agarwal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)
Antonakakis, M., April, T., Bailey, M., Bernhard, M., Bursztein, E., Cochran, J., Durumeric, Z., Halderman, J.A., Invernizzi, L., Kallitsis, M., Kumar, D., Lever, C., Ma, Z., Mason, J., Menscher, D., Seaman, C., Sullivan, N., Thomas, K., Zhou, Y.: Understanding the mirai botnet. In: USENIX Security (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/antonakakis
Armbrust, M., et al.: Spark sql: relational data processing in spark. In: SIGMOD, pp. 1383–1394. ACM (2015)
Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. In: SIGMOD, vol. 29, pp. 261–272. ACM (2000)
Ayres, J., et al.: Sequential pattern mining using a bitmap representation. In: KDD, pp. 429–435. ACM (2002)
Babu, S., Bizarro, P., DeWitt, D.: Proactive re-optimization. In: SIGMOD, pp. 107–118. ACM (2005)
Bailis, P., Gan, E., Madden, S., Narayanan, D., Rong, K., Suri, S.: Macrobase: prioritizing attention in fast data. In: SIGMOD, pp. 541–556. ACM (2017)
Bailis, P., et al.: Prioritizing attention in fast data: principles and promise. In: CIDR. Google Scholar (2017)
Baralis, E., Cerquitelli, T., Chiusano, S.: Index support for frequent itemset mining in a relational dbms. In: ICDE, pp. 754–765. IEEE (2005)
Baralis, E., Cerquitelli, T., Chiusano, S.: Imine: index support for item set mining. IEEE Trans. Knowl. Data Eng. 21(4), 493–506 (2009)
Article Google Scholar
Baraniuk, R.G.: Compressive sensing [lecture notes]. IEEE Signal Process. Mag. 24(4), 118–121 (2007)
Article Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001)
Article MathSciNet Google Scholar
Bittorf, M., et al.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR (2015)
Burdick, D., Calimlim, M., Gehrke, J.: Mafia: a maximal frequent itemset algorithm for transactional databases. In: ICDE, pp. 443–452. IEEE (2001)
Chambi, S., et al.: Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 46(5), 709–719 (2016)
Article Google Scholar
Chambi, S., et al.: Optimizing druid with roaring bitmaps. In: IDEAS, pp. 77–86. ACM (2016)
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43. ACM (1998)
Chen, L., et al.: Towards linear algebra over normalized data. PVLDB 10(11), 1214–1225 (2017)
Google Scholar
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56, 74–80 (2013)
Article Google Scholar
Deshpande, A., et al.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)
Article Google Scholar
Durumeric, Z., et al.: The matter of heartbleed. In: IMC, pp. 475–488. ACM (2014)
Durumeric, Z., et al.: A search engine backed by Internet-wide scanning. In: SIGSAC, pp. 542–553. ACM (2015)
Fagin, R., et al.: Efficient implementation of large-scale multi-structural databases. In: VLDB, pp. 958–969. VLDB Endowment (2005)
Fagin, R., et al.: Multi-structural databases. In: PODS, pp. 184–195. ACM (2005)
Fang, W., et al.: Frequent itemset mining on graphics processors. In: DaMoN, pp. 34–42. ACM (2009)
Fournier-Viger, P., et al.: The SPMF open-source data mining library version 2. In: Joint European conference on machine learning and knowledge discovery in databases, pp. 36–40. Springer (2016)
Graefe, G., McKenna, W.J.: The volcano optimizer generator: extensibility and efficient search. In: ICDE, pp. 209–218. IEEE (1993)
Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997)
Article Google Scholar
Greenberg, A., et al.: The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2008)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)
MATH Google Scholar
Hall, M.A.: Correlation-based feature selection of discrete and numeric class machine learning. Working Paper Series (2000)
Hellerstein, J.M., Stonebraker, M.: Readings in database systems. MIT press (2005)
Hellerstein, J.M., et al.: Architecture of a database system. Found. Trends® Databases 1(2), 141–259 (2007)
Hoi, S.C., et al.: Online feature selection for mining big data. In: BigMine, pp. 93–100. ACM (2012)
Ilyas, I.F., et al.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658. ACM (2004)
Ioannidis, Y.E., Christodoulakis, S.: On the Propagation of Errors in the Size of Join Results, vol. 20. ACM, New York (1991)
Google Scholar
Khoussainova, N., Balazinska, M., Suciu, D.: Perfxplain: debugging mapreduce job performance. PVLDB 5(7), 598–609 (2012)
Google Scholar
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Wiley, Hoboken (2011)
Google Scholar
Konda, P., et al.: Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. PVLDB 6(12), 1306–1309 (2013)
Google Scholar
Kumar, A.: Learning over joins. Ph.D. thesis, The University of Wisconsin-Madison (2016)
Kumar, A., Naughton, J., Patel, J.M.: Learning generalized linear models over normalized data. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1969–1984. ACM (2015)
Kumar, A., et al.: To join or not to join?: thinking twice about joins before feature selection. In: SIGMOD, pp. 19–34. ACM (2016)
Lamb, A., et al.: The vertica analytic database: C-store 7 years later. VLDB 5(12), 1790–1801 (2012)
Google Scholar
Leskovec, J., et al.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Li, H., et al.: Pfp: parallel fp-growth for query recommendation. In: RecSys, pp. 107–114. ACM (2008)
Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. (CSUR) 50(6), 94 (2017)
Google Scholar
Meliou, A., Roy, S., Suciu, D.: Causality and explanations in databases. PVLDB 7(13), 1715–1716 (2014)
Google Scholar
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1–2), 330–339 (2010)
Google Scholar
Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Neumann, T., Radke, B.: Adaptive optimization of very large join queries. In: SIGMOD, pp. 677–692. ACM (2018)
Ngo, H.Q., et al.: Worst-case optimal join algorithms. J. ACM: JACM 65(3), 16 (2018)
Article MathSciNet Google Scholar
O’Neil, P., Quass, D.: Improved query performance with variant indexes. In: SIGMOD, vol. 26, pp. 38–49. ACM (1997)
Pagh, A., Pagh, R.: Scalable computation of acyclic joins. In: PODS, pp. 225–232. ACM (2006)
Rounds, E.: A combined nonparametric approach to feature selection and binary decision tree design. Pattern Recogn. 12(5), 313–317 (1980)
Article Google Scholar
Roy, S., Suciu, D.: A formal approach to finding explanations for database queries. In: SIGMOD, pp. 1579–1590. ACM (2014)
Roy, S., et al.: Perfaugur: robust diagnostics for performance anomalies in cloud services. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 1167–1178. IEEE (2015)
Rupert Jr., G., et al.: Simultaneous Statistical Inference. Springer, Berlin (2012)
Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Schuh, S., Chen, X., Dittrich, J.: An experimental comparison of thirteen relational equi-joins in main memory. In: SIGMOD, pp. 1961–1976. ACM (2016)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pp. 23–34 (1979)
Shang, X., Sattler, KU., Geist, I.: SQL based frequent pattern mining with FP-growth. In: Seipel, D., Hanus, M., Geske, U., Bartenstein, O. (eds.) Applications of Declarative Programming and Knowledge Management. INAP 2004, WLP 2004. Lecture Notes in Computer Science, vol. 3392. Springer, Berlin, Heidelberg (2005). https://doi.org/10.1007/11415763_3
Stonebraker, M., et al.: C-store: a column-oriented dbms. In: VLDB, pp. 553–564. VLDB Endowment (2005)
Wang, X., et al.: Data x-ray: a diagnostic tool for data errors. In: SIGMOD, pp. 1231–1245. ACM (2015)
Willard, D.E.: Applications of range query theory to relational data base join and selection operations. J. Comput. Syst. Sci. 52(1), 157–169 (1996)
Article MathSciNet Google Scholar
Wu, E., Madden, S.: Scorpion: explaining away outliers in aggregate queries. PVLDB 6(8), 553–564 (2013)
Google Scholar
Yang, F., et al.: Druid: A real-time analytical data store. In: SIGMOD, pp. 157–168. ACM (2014)
Yoon, D.Y., Niu, N., Mozafari, B.: Dbsherlock: a performance diagnostic tool for transactional databases. In: SIGMOD, pp. 1599–1614. ACM (2016)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 2–2. USENIX Association (2012)
Zhang, F., Zhang, Y., Bakos, J.: Gpapriori: Gpu-accelerated frequent itemset mining. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER), pp. 590–594. IEEE (2011)

Download references

Acknowledgements

We thank Kexin Rong, Hector Garcia-Molina, our colleagues in the Stanford DAWN Project, and the anonymous VLDB reviewers for their detailed feedback on earlier drafts of this work. This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata, and VMware—as well as Toyota Research Institute, Keysight Technologies, Hitachi, Northrop Grumman, Amazon Web Services, Juniper Networks, NetApp, and the NSF under CAREER grant CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Stanford DAWN Project, Stanford University, Stanford, CA, USA
Firas Abuzaid, Peter Kraft, Sahaana Suri, Edward Gan, Eric Xu, Peter Bailis & Matei Zaharia
Microsoft Inc, Redmond, WA, USA
Atul Shenoy, Asvin Ananthanarayan & John Sheu
Facebook Inc, Menlo Park, CA, USA
Erik Meijer
Google Inc, Mountain View, CA, USA
Xi Wu & Jeff Naughton

Authors

Firas Abuzaid
View author publications
You can also search for this author in PubMed Google Scholar
Peter Kraft
View author publications
You can also search for this author in PubMed Google Scholar
Sahaana Suri
View author publications
You can also search for this author in PubMed Google Scholar
Edward Gan
View author publications
You can also search for this author in PubMed Google Scholar
Eric Xu
View author publications
You can also search for this author in PubMed Google Scholar
Atul Shenoy
View author publications
You can also search for this author in PubMed Google Scholar
Asvin Ananthanarayan
View author publications
You can also search for this author in PubMed Google Scholar
John Sheu
View author publications
You can also search for this author in PubMed Google Scholar
Erik Meijer
View author publications
You can also search for this author in PubMed Google Scholar
Xi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Naughton
View author publications
You can also search for this author in PubMed Google Scholar
Peter Bailis
View author publications
You can also search for this author in PubMed Google Scholar
Matei Zaharia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Firas Abuzaid.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Translating DIFF to standard SQL

We present a sample DIFF query, borrowed from the Example Workflow in Sect. 2.1, and its translation into standard SQL.

This query is equivalent to the following Postgres-compatible SQL query:

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abuzaid, F., Kraft, P., Suri, S. et al. DIFF: a relational interface for large-scale data explanation. The VLDB Journal 30, 45–70 (2021). https://doi.org/10.1007/s00778-020-00633-6

Download citation

Received: 02 February 2020
Revised: 16 August 2020
Accepted: 26 August 2020
Published: 30 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00778-020-00633-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DIFF: a relational interface for large-scale data explanation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic OLAP Patterns: Elements of Reusable Business Analytics

Dedalo: Looking for Clusters Explanations in a Labyrinth of Linked Data

Data Handling and Manipulation in R with Descriptive Statistics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Translating DIFF to standard SQL

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DIFF: a relational interface for large-scale data explanation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic OLAP Patterns: Elements of Reusable Business Analytics

Dedalo: Looking for Clusters Explanations in a Labyrinth of Linked Data

Data Handling and Manipulation in R with Descriptive Statistics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Translating DIFF to standard SQL

Translating DIFF to standard SQL

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation