Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ICDE.2006.9guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

A Primitive Operator for Similarity Joins in Data Cleaning

Published: 03 April 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity. We then propose efficient implementations for this operator. In an experimental evaluation using real datasets, we show that the implementation of similarity joins using our operator is comparable to, and often substantially better than, previous customized implementations for particular similarity functions.

    Cited By

    View all
    • (2024)Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular DataProceedings of the ACM on Management of Data10.1145/36549572:3(1-28)Online publication date: 30-May-2024
    • (2024)Worst-Case-Optimal Similarity Joins on Graph DatabasesProceedings of the ACM on Management of Data10.1145/36392942:1(1-26)Online publication date: 26-Mar-2024
    • (2024)Similarity Joins of Sparse FeaturesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653370(80-92)Online publication date: 9-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDE '06: Proceedings of the 22nd International Conference on Data Engineering
    April 2006
    ISBN:0769525709

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 03 April 2006

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular DataProceedings of the ACM on Management of Data10.1145/36549572:3(1-28)Online publication date: 30-May-2024
    • (2024)Worst-Case-Optimal Similarity Joins on Graph DatabasesProceedings of the ACM on Management of Data10.1145/36392942:1(1-26)Online publication date: 26-Mar-2024
    • (2024)Similarity Joins of Sparse FeaturesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653370(80-92)Online publication date: 9-Jun-2024
    • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
    • (2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
    • (2023)Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphProceedings of the VLDB Endowment10.14778/3603581.360359616:10(2578-2590)Online publication date: 1-Jun-2023
    • (2023)DeepJoin: Joinable Table Discovery with Pre-Trained Language ModelsProceedings of the VLDB Endowment10.14778/3603581.360358716:10(2458-2470)Online publication date: 8-Aug-2023
    • (2023)Grouping Time Series for Efficient Columnar StorageProceedings of the ACM on Management of Data10.1145/35887031:1(1-26)Online publication date: 30-May-2023
    • (2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
    • (2022)VSIMJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.009158:C(29-46)Online publication date: 22-Apr-2022
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media