Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397536.3422220acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article

TrioStat: Online Workload Estimation in Distributed Spatial Data Streaming Systems

Published: 13 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    The wide spread of GPS-enabled devices and the Internet of Things (IoT) has increased the amount of spatial data being generated every second. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial data streaming systems that scale to process in real-time large amounts of streamed spatial data. The performance of distributed streaming systems relies on how even the workload is distributed among their machines. However, it is challenging to estimate the workload of each machine because spatial data and query streams are skewed and rapidly change with time and users' interests. Moreover, a distributed spatial streaming system often does not maintain a global system workload state because it requires high network and processing overheads to be collected from the machines in the system.
    This paper introduces TrioStat; an online workload estimation technique that relies on a probabilistic model for estimating the workload of partitions and machines in a distributed spatial data streaming system. It is infeasible to collect and exchange statistics with a centralized unit because it requires high network overhead. Instead, TrioStat uses a decentralised technique to collect and maintain the required statistics in real-time locally in each machine. TrioStat enables distributed spatial data streaming systems to compare the workloads of machines as well as the workloads of data partitions. TrioStat requires minimal network and storage overhead. Moreover, the required storage is distributed across the system's machines.

    References

    [1]
    2020. Apatche Hadoop. http://hadoop.apache.org/.
    [2]
    2020. Apatche Zookeeper. https://zookeeper.apache.org.
    [3]
    2020. Internet live stats. https://internetlivestats.com/.
    [4]
    Ahmed M. Aly, Ahmed R. Mahmood, Mohamed S. Hassan, Walid G. Aref, Mourad Ouzzani, Hazem Elmeleegy, and Thamir Qadah. 2015. AQWA: Adaptive Query Workload Aware Partitioning of Big Spatial Data. Proc. VLDB Endow. 8, 13 (Sept. 2015), 2062--2073.
    [5]
    Ahmed M Aly, Asmaa Sallam, Bala M Gnanasekaran, Long-Van Nguyen-Dinh, Walid G Aref, Mourad Ouzzani, and Arif Ghafoor. 2012. M3: Stream processing on main-memory mapreduce. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 1253--1256.
    [6]
    Ning An, Zhen-Yu Yang, and Anand Sivasubramaniam. 2001. Selectivity estimation for spatial joins. In Proceedings 17th International Conference on Data Engineering. IEEE, 368--375.
    [7]
    Richard Beigel and Egemen Tanin. 1998. The geometry of browsing. In Latin American Symposium on Theoretical Informatics. Springer, 331--340.
    [8]
    Alberto Belussi and Christos Faloutsos. 1998. Self-spacial join selectivity estimation using fractal concepts. ACM Transactions on Information Systems (TOIS) 16, 2 (1998), 161--201.
    [9]
    Alberto Belussi, Sara Migliorini, and Ahmed Eldawy. 2018. Detecting skewness of big spatial data in SpatialHadoop. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. 432--435.
    [10]
    Zhida Chen, Gao Cong, and Walid G Aref. 2020. STAR: A Distributed Stream Warehouse System for Spatial Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2761--2764.
    [11]
    Zhida Chen, Gao Cong, Zhenjie Zhang, Tom ZJ Fuz, and Lisi Chen. 2017. Distributed Publish/Subscribe Query Processing on the Spatio-Textual Data Stream. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 1095--1106.
    [12]
    Ahmed Eldawy and Mohamed F Mokbel. 2015. Spatialhadoop: A mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 1352--1363.
    [13]
    Junhua Fang, Rong Zhang, Tom ZJ Fu, Zhenjie Zhang, Aoying Zhou, and Junhua Zhu. 2017. Parallel stream processing against workload skewness and variance. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. 15--26.
    [14]
    Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, and Ramakrishnan Srikant. 1997. Range queries in OLAP data cubes. ACM SIGMOD Record 26, 2 (1997), 73--88.
    [15]
    Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.
    [16]
    Ahmed R Mahmood, Ahmed M Aly, Thamir Qadah, El Kindi Rezig, Anas Daghistani, Amgad Madkour, Ahmed S Abdelhamid, Mohamed S Hassan, Walid G Aref, and Saleh Basalamah. 2015. Tornado: A distributed spatio-textual stream processing system. PVLDB 8, 12 (2015), 2020--2023.
    [17]
    Ahmed R Mahmood, Anas Daghistani, Ahmed M Aly, Mingjie Tang, Saleh Basalamah, Sunil Prabhakar, and Walid G Aref. 2018. Adaptive processing of spatial-keyword data over a distributed streaming cluster. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 219--228.
    [18]
    Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, David Garcia-Soriano, Nicolas Kourtellis, and Marco Serafini. 2015. The power of both choices: Practical load balancing for distributed stream processing engines. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 137--148.
    [19]
    Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, Nicolas Kourtellis, and Marco Serafini. 2016. When two choices are not enough: Balancing at scale in distributed stream processing. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 589--600.
    [20]
    Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi. 2001. Flexible data cubes for online aggregation. In International Conference on Database Theory. Springer, 159--173.
    [21]
    Anil Shanbhag, Alekh Jindal, Yi Lu, and Samuel Madden. 2016. A moeba: a shape changing storage system for big data. Proceedings of the VLDB Endowment 9, 13 (2016), 1569--1572.
    [22]
    Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, 229--241.
    [23]
    Chengyu Sun, Divyakant Agrawal, and Amr El Abbadi. 2002. Selectivity estimation for spatial joins with geometric selections. In International Conference on Extending Database Technology. Springer, 609--626.
    [24]
    Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156.
    [25]
    Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. 2012. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. HotCloud 12 (2012), 10--10.
    [26]
    F. Zhang, H. Chen, and H. Jin. 2019. Simois: A Scalable Distributed Stream Join System with Skewed Workloads. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). 176--185.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGSPATIAL '20: Proceedings of the 28th International Conference on Advances in Geographic Information Systems
    November 2020
    687 pages
    ISBN:9781450380195
    DOI:10.1145/3397536
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 November 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Workload estimation
    2. collecting statistics
    3. distributed streaming systems
    4. load balancing
    5. spatial stream processing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • National Science Foundation

    Conference

    SIGSPATIAL '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 220 of 1,116 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 122
      Total Downloads
    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media