Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3458456acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Correlation Sketches for Approximate Join-Correlation Queries

Published: 18 June 2021 Publication History

Abstract

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column KQ from a query table TQ, retrieve tables TX in a dataset collection such that TX is joinable with TQ on KQ and there is a column C ∈ TX such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Supplementary Material

MP4 File (3448016.3458456.mp4)
The growing number of available structured datasets, from Web tables and open-data portals to enterprise data, open up new opportunities to enrich analytics and improve machine learning models through data augmentation. In this paper, we introduce a new class of augmentation queries, join-correlation queries, which given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection (or data lake) $\cal{D}$ such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A straightforward approach to evaluate these queries is to first find joinable tables, and then to explicitly compute correlations between $Q$ and all columns of the discovered tables. However, for queries over large collections or that return large tables, doing so for many candidate results is prohibitively expensive. To efficiently support correlation column discovery, we 1) propose a new sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketching method attains high accuracy and the scoring strategies lead to high-quality rankings.

References

[1]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Rec., 28(2):275--286, June 1999.
[2]
Apache lucene. https://lucene.apache.org/index.html.
[3]
M. Baak, R. Koopman, H. Snoek, and S. Klous. A new correlation coefficient between categorical, ordinal and interval variables with pearson characteristics. Computational Statistics & Data Analysis, page 107043, 2020.
[4]
S. Bapat. Discover, understand and manage your data with Data Catalog, now GA. https://cloud.google.com/blog/products/data-analytics/data-catalog-metadata-management-now-generally-available, 2020. [Online; accessed 22-June-2020].
[5]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In J. D. P. Rolim and S. Vadhan, editors, Randomization and Approximation Techniques in Computer Science, pages 1--10, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
[6]
R. Bardenet, O.-A. Maillard, et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361--1385, 2015.
[7]
K. J. Berry and P. W. Mielke Jr. A monte carlo investigation of the fisher z transformation for normal and nonnormal distributions. Psychological Reports, 87(3_suppl):1101--1114, 2000.
[8]
K. Beyer, R. Gemulla, P. J. Haas, B. Reinwald, and Y. Sismanis. Distinct-value synopses for multiset operations. Commun. ACM, 52(10):87--95, Oct. 2009.
[9]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD '07, pages 199--210, New York, NY, USA, 2007. ACM.
[10]
A. J. Bishara and J. B. Hittner. Testing the significance of a correlation with nonnormal data: Comparison of Pearson, Spearman, transformation, and resampling approaches. Psychological Methods, 2012.
[11]
A. J. Bishara and J. B. Hittner. Reducing bias and error in the correlation coefficient due to nonnormality. Educational and Psychological Measurement, 75(5):785--804, 2015.
[12]
A. J. Bishara and J. B. Hittner. Confidence intervals for correlations when data are not normal. Behavior Research Methods, 49(1):294--309, 2017.
[13]
A. J. Bishara, J. Li, and T. Nash. Asymptotic confidence intervals for the pearson correlation via skewness and kurtosis. British Journal of Mathematical and Statistical Psychology, 71(1):167--185, 2018.
[14]
C. I. Bliss et al. Statistics in biology. statistical methods for research in the natural sciences. Statistics in biology. Statistical methods for research in the natural sciences., 1967.
[15]
D. Bonett and T. A. Wright. Sample size requirements for estimating pearson, kendall and spearman correlations. Psychometrika, 65(1):23--28, 2000.
[16]
A. Bowley. The standard deviation of the correlation coefficient. Journal of the American Statistical Association, 23(161):31--34, 1928.
[17]
D. Brickley, M. Burgess, and N. Noy. Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference, WWW '19, pages 1365--1375, New York, NY, USA, 2019. ACM.
[18]
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, volume 3551 of Lecture Notes in Computer Science, pages 161--180. Springer, 2005.
[19]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538--549, 2008.
[20]
R. Castro Fernandez, J. Min, D. Nava, and S. Madden. Lazo: A cardinality-based method for coupled estimation of jaccard similarity and containment. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1190--1201, April 2019.
[21]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, page 268--279, New York, NY, USA, 2000. Association for Computing Machinery.
[22]
Y. Chen and K. Yi. Two-level sampling for join size estimation. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, page 759--774, New York, NY, USA, 2017. Association for Computing Machinery.
[23]
N. Chepurko, R. Marcus, E. Zgraggen, R. C. Fernandez, T. Kraska, and D. Karger. Arda: Automatic relational data augmentation for machine learning. Proceedings of the VLDB Endowment, 13(9), 2020.
[24]
E. Cohen and H. Kaplan. Tighter estimation using bottom k sketches. Proc. VLDB Endow., 1(1):213--224, Aug. 2008.
[25]
R. Cohen, L. Katzir, and A. Yehezkel. A minimal variance estimator for the cardinality of big data set intersection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, page 95--103, New York, NY, USA, 2017. Association for Computing Machinery.
[26]
G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2012.
[27]
S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. Practical hash functions for similarity estimation and dimensionality reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 6618--6628, Red Hook, NY, USA, 2017. Curran Associates Inc.
[28]
A. Dasgupta, K. J. Lang, L. Rhodes, and J. Thaler. A Framework for Estimating Stream Expression Cardinalities. In W. Martens and T. Zeume, editors, 19th International Conference on Database Theory (ICDT 2016), volume 48 of Leibniz International Proceedings in Informatics (LIPIcs), pages 6:1--6:17, Dagstuhl, Germany, 2016. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik.
[29]
A. Dasgupta, K. J. Lang, L. Rhodes, and J. Thaler. A Framework for Estimating Stream Expression Cardinalities. In W. Martens and T. Zeume, editors, 19th International Conference on Database Theory (ICDT 2016), volume 48 of Leibniz International Proceedings in Informatics (LIPIcs), pages 6:1--6:17, Dagstuhl, Germany, 2016. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik.
[30]
J. C. de Winter, S. D. Gosling, and J. Potter. Comparing the pearson and spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychological Methods, 2016.
[31]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In Cidr, 2017.
[32]
S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. Robust estimation and outlier detection with correlation coefficients. Biometrika, 62(3):531--545, 12 1975.
[33]
N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6):32--es, Dec. 2007.
[34]
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1994.
[35]
C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In 22nd International Conference on Data Engineering (ICDE'06), pages 20--20, 2006.
[36]
R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker. Aurum: A Data Discovery System. In ICDE '18, pages 1001--1012, 2018.
[37]
P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In P. Jacquet, editor, AofA: Analysis of Algorithms, volume DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07) of DMTCS Proceedings, pages 137--156, Juan les Pins, France, June 2007. Discrete Mathematics and Theoretical Computer Science.
[38]
S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. SIGMOD Rec., 25(2):271--281, June 1996.
[39]
M. Grover. Amundsen - Lyft's data discovery & metadata engine. https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9, 2019. [Online; accessed 20-October-2019].
[40]
M. Grover. Data Catalog Market | Size & Growth Report, 2020--2027. https://www.reportsanddata.com/report-detail/data-catalog-market, 2020. [Online; accessed 28-March-2021].
[41]
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the 21th International Conference on Very Large Data Bases, VLDB '95, page 311--322, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[42]
H. Harmouch and F. Naumann. Cardinality estimation: An experimental survey. Proc. VLDB Endow., 11(4):499--512, Dec. 2017.
[43]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13--30, 1963.
[44]
X. Hu, A. Jung, and G. Qin. Interval estimation for the correlation coefficient. The American Statistician, 74(1):29--36, 2020.
[45]
D. Huang, D. Y. Yoon, S. Pettie, and B. Mozafari. Joins on samples: a theoretical guide for practitioners. Proceedings of the VLDB Endowment, 13(4):547--560, 2019.
[46]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003.
[47]
K. J"arvelin and J. Kek"al"ainen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446, 2002.
[48]
A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13--16, 2019, Online Proceedings. www.cidrdb.org, 2019.
[49]
D. Knuth, Addison-Wesley, and P. Education. The Art of Computer Programming. Number v. 3 in Addison-Wesley series in computer science and information processing. Addison-Wesley, 1997.
[50]
M. Lan. DataHub: A generalized metadata search & discovery tool. https://engineering.linkedin.com/blog/2019/data-hub, 2019. [Online; accessed 22-June-2020].
[51]
O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 75--76, 2016.
[52]
O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer. The mannheim search join engine. Journal of Web Semantics, 35:159 -- 166, 2015.
[53]
V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org, 2017.
[54]
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling. SIGMOD Rec., 19(2):1--11, May 1990.
[55]
T. Micceri. The unicorn, the normal curve, and other improbable creatures. Psychological bulletin, 105(1):156, 1989.
[56]
F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on open data. Proceedings of the VLDB Endowment, 11(7):813--825, 2018.
[57]
Nyc vision zero initiative. http://www1.nyc.gov/site/visionzero/index.page.
[58]
NYC OpenData. https://opendata.cityofnewyork.us.
[59]
United States Government Open Data. https://www.data.gov.
[60]
S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: A new data layout scheme in DB2. In SIGMOD, pages 637--641, 2003.
[61]
C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1973.
[62]
J. L. Rodgers and W. A. Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59--66, 1988.
[63]
F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Trans. Database Syst., 33(3), Sept. 2008.
[64]
A. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire. Correlation sketches for approximate join-correlation queries. arXiv preprint arXiv:2104.03353, 2021.
[65]
G. L. Shevlyakov and H. Oja. Robust correlation: Theory and applications, volume 3. John Wiley & Sons, 2016.
[66]
G. Shieh. Estimation of the simple correlation coefficient. Behavior Research Methods, 42(4):906--917, 2010.
[67]
The Socrata Open Data API. https://dev.socrata.com.
[68]
G. J. Székely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6):2769--2794, 12 2007.
[69]
The Tablesaw Library. https://github.com/jtablesaw/tablesaw.
[70]
D. Ting. Towards optimal cardinality estimation of unions and intersections with sketches. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195--1204, 2016.
[71]
P. Venetis, Y. Sismanis, and B. Reinwald. Crsi: a compact randomized similarity index for set-valued features. In Proceedings of the 15th International Conference on Extending Database Technology, pages 384--395, 2012.
[72]
D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proc. VLDB Endow., 8(12):1530--1541, Aug. 2015.
[73]
R. R. Wilcox. Confidence intervals for the slope of a regression line when the error term has nonconstant variance. Computational Statistics & Data Analysis, 22(1):89--98, 1996.
[74]
C. C. Williams. Democratizing Data at Airbnb. https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770, 2017. [Online; accessed 22-June-2020].
[75]
World Bank Open Data. https://data.worldbank.org.
[76]
World Bank Group Finances. https://finances.worldbank.org.
[77]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):1--41, 2011.
[78]
Y. Yang, Y. Zhang, W. Zhang, and Z. Huang. Gb-kmv: An augmented kmv sketch for approximate containment similarity search. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 458--469, April 2019.
[79]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. Proc. VLDB Endow., 13(3):279--292, Nov. 2019.
[80]
K.-H. Yuan and P. M. Bentler. Inferences on correlation coefficients in some classes of nonnormal distributions. Journal of Multivariate Analysis, 72(2):230 -- 248, 2000.
[81]
K.-H. Yuan, P. M. Bentler, and W. Zhang. The effect of skewness and kurtosis on mean and covariance structure analysis: The univariate case and its multivariate implication. Sociological Methods & Research, 34(2):240--258, 2005.
[82]
S. Zhang and K. Balog. Ad hoc table retrieval using semantic similarity. In Proceedings of the 2018 World Wide Web Conference, WWW '18, pages 1553--1562, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee.
[83]
S. Zhang and K. Balog. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(2):1--35, 2020.
[84]
Y. Zhang and Z. G. Ives. Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1951--1966, 2020.
[85]
E. Zhu, D. Deng, F. Nargesian, and R. J. Miller. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD '19, pages 847--864, New York, NY, USA, 2019. ACM.
[86]
E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. Lsh ensemble: Internet-scale domain search. Proc. VLDB Endow., 9(12):1185--1196, Aug. 2016.

Cited By

View all

Index Terms

  1. Correlation Sketches for Approximate Join-Correlation Queries

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. approximate query processing
      2. confidence intervals
      3. dataset search
      4. join-correlation estimation
      5. sketching algorithms

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)408
      • Downloads (Last 6 weeks)41
      Reflects downloads up to 02 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Navigating Data Repositories: Utilizing Line Charts to Discover Relevant DatasetsProceedings of the VLDB Endowment10.14778/3685800.368585717:12(4289-4292)Online publication date: 8-Nov-2024
      • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
      • (2024)Sampling Methods for Inner Product SketchingProceedings of the VLDB Endowment10.14778/3665844.366585017:9(2185-2197)Online publication date: 1-May-2024
      • (2024)Digging Up Threats to Validity: A Data Marshalling Approach to Sensitivity AnalysisProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669850(1-5)Online publication date: 9-Jun-2024
      • (2024)Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular DataProceedings of the ACM on Management of Data10.1145/36549572:3(1-28)Online publication date: 30-May-2024
      • (2024)Demonstrating Nexus for Correlation Discovery over Collections of Spatio-Temporal Tabular DataCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654747(524-527)Online publication date: 9-Jun-2024
      • (2024)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: Nov-2024
      • (2024)Priority Sketch: A Priority-aware Measurement Framework2024 International Conference on Satellite Internet (SAT-NET)10.1109/SAT-NET62854.2024.00012(18-23)Online publication date: 25-Oct-2024
      • (2024)BitMatcher: Bit-level Counter Adjustment for Sketches2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00366(4815-4827)Online publication date: 13-May-2024
      • (2024)Sketches-Based Join Size Estimation Under Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00140(1726-1738)Online publication date: 13-May-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media