research-article

Public Access

Correlation Sketches for Approximate Join-Correlation Queries

Authors:

Juliana FreireAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1531 - 1544

https://doi.org/10.1145/3448016.3458456

Published: 18 June 2021 Publication History

PDF eReader

Abstract

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column KQ from a query table TQ, retrieve tables TX in a dataset collection such that TX is joinable with TQ on KQ and there is a column C ∈ TX such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Supplementary Material

MP4 File (3448016.3458456.mp4)

The growing number of available structured datasets, from Web tables and open-data portals to enterprise data, open up new opportunities to enrich analytics and improve machine learning models through data augmentation. In this paper, we introduce a new class of augmentation queries, join-correlation queries, which given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection (or data lake) $\cal{D}$ such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A straightforward approach to evaluate these queries is to first find joinable tables, and then to explicitly compute correlations between $Q$ and all columns of the discovered tables. However, for queries over large collections or that return large tables, doing so for many candidate results is prohibitively expensive. To efficiently support correlation column discovery, we 1) propose a new sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketching method attains high accuracy and the scoring strategies lead to high-quality rankings.

Download
192.67 MB

References

[1]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Rec., 28(2):275--286, June 1999.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Approximate Query Processing with Error Guarantees

Approximate Query Processing Based on Approximate Materialized View

Acceleration of Synopsis Construction for Bounded Approximate Query Processing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations