Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389717acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient Join Synopsis Maintenance for Data Warehouse

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Various sources such as daily business operations and sensors from different IoT applications constantly generate a lot of data. They are often loaded into a data warehouse system to perform complex analysis over. It, however, can be extremely costly if the query involves joins, especially many-to-many joins over multiple large tables. A join synopsis, i.e., a small uniform random sample over the join result, often suffices as a representative alternative to the full join result for many applications such as histogram construction, model training and etc. Towards that end, we propose a novel algorithm SJoin that can maintain a join synopsis over a pre-specified general θ-join query in a dynamic database with continuous inflows of updates. Central to SJoin is maintaining a weighted join graph index, which assists to efficiently replace join results in the synopsis upon update. We conduct extensive experiments using TPC-DS and a simulated road sensor data over several complex join queries and they demonstrate the clear advantage of SJoin over the best available baseline.

    Supplementary Material

    MP4 File (3318464.3389717.mp4)
    Presentation Video

    References

    [1]
    Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join Synopses for Approximate Query Answering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (Philadelphia, Pennsylvania, USA) (SIGMOD '99). ACM, New York, NY, USA, 275--286. https://doi.org/10.1145/304182.304207
    [2]
    Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Data Management Benchmark. In VLDB.
    [3]
    Moses Charikar, K. Chen, and Michael Farach-Colton. 2002. Finding Frequent Items in Data Streams. In Proceedings International Colloquium on Automata, Languages and Programming.
    [4]
    Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1998. Random sampling for histogram construction: how much is enough?. In Proc. ACM SIGMOD International Conference on Management of Data.
    [5]
    Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On Random Sampling over Joins. In Proc. ACM SIGMOD International Conference on Management of Data.
    [6]
    G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, Vol. 55 (2005), 58--75.
    [7]
    Abhinandan Das, Johannes Gehrke, and Mirek Riedewald. 2003. Approximate Join Processing over Data Streams. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) (SIGMOD '03). Association for Computing Machinery, New York, NY, USA, 40--51. https://doi.org/10.1145/872757.872765
    [8]
    Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. 2002. Processing Complex Aggregate Queries over Data Streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (Madison, Wisconsin) (SIGMOD '02). Association for Computing Machinery, New York, NY, USA, 61--72. https://doi.org/10.1145/564691.564699
    [9]
    Sumit Ganguly, Minos N. Garofalakis, and Rajeev Rastogi. 2004. Processing Data-Stream Join Aggregates Using Skimmed Sketches. In Advances in Database Technology - EDBT 2004, 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14--18, 2004, Proceedings (Lecture Notes in Computer Science), Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens Bö hm, and Elena Ferrari (Eds.), Vol. 2992. Springer, 569--586. https://doi.org/10.1007/978--3--540--24741--8_33
    [10]
    Lukasz Golab and M. Tamer Özsu. 2005. Update-pattern-aware Modeling and Processing of Continuous Queries. In SIGMOD. 658--669.
    [11]
    P. J. Haas and J. M. Hellerstein. 1999. Ripple Joins for Online Aggregation. In Proc. ACM SIGMOD International Conference on Management of Data. 287--298.
    [12]
    Silu Huang, Chi Wang, Bolin Ding, and Surajit Chaudhuri. 2019. Efficient Identification of Approximate Best Configuration of Training in Large Datasets. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 3862--3869. https://doi.org/10.1609/aaai.v33i01.33013862
    [13]
    IBMStreams. 2016. IBM Streams Linear Road Benchmark. https://github.com/IBMStreams/benchmarks/tree/master/StreamsLinearRoadBenchmark
    [14]
    Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). ACM, New York, NY, USA, 631--646. https://doi.org/10.1145/2882903.2882940
    [15]
    Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In SIGMOD.
    [16]
    J. Misra and D. Gries. 1982. Finding repeated elements. Sc. Comp. Prog., Vol. 2 (1982), 143--152.
    [17]
    Mohamed F. Mokbel, Ming Lu, and Walid G. Aref. 2004. Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results. In Proceedings of the 20th International Conference on Data Engineering (ICDE '04). IEEE Computer Society, Washington, DC, USA, 251--. http://dl.acm.org/citation.cfm?id=977401.978115
    [18]
    AJoinnita N. Wilschut and Peter M. G. Apers. 1991. Dataflow Query Execution in a Parallel Main-Memory Environment. In Proceedings of the First International Conference on Parallel and Distributed Information Systems (PDIS 1991), Fontainebleu Hilton Resort, Miami Beach, Florida, USA, December 4--6, 1991. 68--77. https://doi.org/10.1109/PDIS.1991.183069
    [19]
    Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1135--1152. https://doi.org/10.1145/3299869.3300077
    [20]
    Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented Sketch: Faster and More Accurate Stream Processing. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1449--1463. https://doi.org/10.1145/2882903.2882948
    [21]
    Florin Rusu and Alin Dobra. 2008. Sketches for Size of Join Estimation. ACM Trans. Database Syst., Vol. 33, 3, Article 15 (Sept. 2008), 46 pages. https://doi.org/10.1145/1386118.1386121
    [22]
    Utkarsh Srivastava and Jennifer Widom. 2004. Memory-limited Execution of Windowed Stream Joins. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB '04). VLDB Endowment, 324--335. http://dl.acm.org/citation.cfm?id=1316689.1316719
    [23]
    Yufei Tao, Xiang Lian, Dimitris Papadias, and Marios Hadjieleftheriou. 2007. Random Sampling for Continuous Streams with Arbitrary Updates. IEEE TKDE, Vol. 19, 1 (2007), 96--110.
    [24]
    Yufei Tao, Man Lung Yiu, Dimitris Papadias, Marios Hadjieleftheriou, and Nikos Mamoulis. 2005. RPJ: Producing Fast Join Results on Streams Through Rate-based Optimization. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD '05). ACM, New York, NY, USA, 371--382. https://doi.org/10.1145/1066157.1066200
    [25]
    Transaction Processing Performance Council. 2019. TPC Benchmark DS. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.11.0.pdf
    [26]
    Tolga Urhan and Michael J. Franklin. 2001. Dynamic Pipeline Scheduling for Improving Interactive Query Performance. In VLDB. 501--510.
    [27]
    V. N. Vapnik and A. Y. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, Vol. 16 (1971), 264--280.
    [28]
    Stratis D. Viglas, Jeffrey F. Naughton, and Josef Burger. 2003. Maximizing the Output Rate of Multi-way Join Queries over Streaming Information Sources. In VLDB.
    [29]
    Jeffrey S. Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw., Vol. 11, 1 (1985).
    [30]
    Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw., Vol. 3, 3 (1977), 253--256.
    [31]
    Walmart Labs. 2016. Walmart Labs Streams Linear Road Benchmark. https://github.com/walmartlabs/LinearGenerator
    [32]
    Junyi Xie and Jun Yang. 2007. A Survey of Join Processing in Data Streams. In Data Streams - Models and Algorithms. 209--236.
    [33]
    Junyi Xie, Jun Yang, and Yuguo Chen. 2005. On Joining and Caching Stochastic Streams. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD '05). Association for Computing Machinery, New York, NY, USA, 359--370. https://doi.org/10.1145/1066157.1066199
    [34]
    Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In SIGMOD. 1525--1539.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. join synopsis
    2. random sampling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)29
    • Downloads (Last 6 weeks)2
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reservoir Sampling over JoinsProceedings of the ACM on Management of Data10.1145/36549212:3(1-26)Online publication date: 30-May-2024
    • (2024)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 1-Jun-2024
    • (2023)PlexusProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624643(1-16)Online publication date: 30-Oct-2023
    • (2023)On Join Sampling and the Hardness of Combinatorial Output-Sensitive Join AlgorithmsProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588666(99-111)Online publication date: 18-Jun-2023
    • (2023)SynopsisDB: Distributed Synopsis-based Data Processing SystemCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589394(289-291)Online publication date: 4-Jun-2023
    • (2022)Cardinality estimation in DBMSProceedings of the VLDB Endowment10.14778/3503585.350358615:4(752-765)Online publication date: 14-Apr-2022
    • (2020)Efficiently approximating selectivity functions using low overhead regression modelsProceedings of the VLDB Endowment10.14778/3407790.340782013:12(2215-2228)Online publication date: 14-Sep-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media