Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

LocationSpark: a distributed in-memory data management system for big spatial data

Published: 01 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark's indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.

    References

    [1]
    Geotrellis. https://github.com/geotrellis/geotrellis.
    [2]
    Magellan. https://github.com/harsha2010/magellan.
    [3]
    Spatialspark. http://simin.me/projects/spatialspark/.
    [4]
    F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. Technical report, National Technical University of Athens, Stanford University, December 2009.
    [5]
    A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow., 6(11):1009--1020, Aug. 2013.
    [6]
    A. M. Aly, A. R. Mahmood, M. S. Hassan, W. G. Aref, M. Ouzzani, H. Elmeleegy, and T. Qadah. AQWA: adaptive query-workload-aware partitioning of big spatial data. PVLDB, 8(13):2062--2073, 2015.
    [7]
    M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM.
    [8]
    J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04. USENIX Association, 2004.
    [9]
    A. Eldawy and M. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In ICDE'15, pages 1352--1363, April 2015.
    [10]
    J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI'14, pages 599--613, Broomfield, CO, Oct. 2014. USENIX Association.
    [11]
    M. S. Jia Yu, Jinxuan Wu. Geospark: A cluster computing framework for processing large-scale spatial data. In ACM SIGSPATIAL'15, Seattle, WA.
    [12]
    Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: Mitigating skew in mapreduce applications. In SIGMOD '12, pages 25--36, New York, NY, USA, 2012. ACM.
    [13]
    W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. Proc. VLDB Endow., 5(10):1016--1027, June 2012.
    [14]
    S. Nishimura, S. Das, D. Agrawal, and A. Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services. In MDM'12, volume 1, pages 7--16, 2011.
    [15]
    S. Shekhar, S. K. Feiner, and W. G. Aref. Spatial computing. Commun. ACM, 59(1):72--81, 2016.
    [16]
    B. Sowell, M. V. Salles, T. Cao, A. Demers, and J. Gehrke. An experimental analysis of iterated spatial joins in main memory. Proc. VLDB Endow., 6(14):1882--1893, Sept. 2013.
    [17]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI'12, pages 15--28, San Jose, CA, 2012. USENIX.
    [18]
    M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP'13, pages 423--438, New York, NY, USA, 2013. ACM.

    Cited By

    View all

    Index Terms

    1. LocationSpark: a distributed in-memory data management system for big spatial data

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 9, Issue 13
        September 2016
        378 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 September 2016
        Published in PVLDB Volume 9, Issue 13

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)58
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)RayJoin: Fast and Precise Spatial JoinProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656610(124-136)Online publication date: 30-May-2024
        • (2024)GridMesaFuture Generation Computer Systems10.1016/j.future.2024.02.010155:C(324-339)Online publication date: 1-Jun-2024
        • (2023)STAR: A Cache-based Stream Warehouse System for Spatial DataACM Transactions on Spatial Algorithms and Systems10.1145/36059449:4(1-27)Online publication date: 27-Jun-2023
        • (2023)Learned Spatial Data PartitioningProceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management10.1145/3593078.3593932(1-8)Online publication date: 18-Jun-2023
        • (2023)ST4ML: Machine Learning Oriented Spatio-Temporal Data Processing at ScaleProceedings of the ACM on Management of Data10.1145/35889411:1(1-28)Online publication date: 30-May-2023
        • (2022)VREProceedings of the VLDB Endowment10.14778/3554821.355483115:12(3398-3410)Online publication date: 1-Aug-2022
        • (2022)Hu-FuProceedings of the VLDB Endowment10.14778/3514061.351406415:6(1159-1172)Online publication date: 1-Feb-2022
        • (2022)SPEAR-boardProceedings of the 30th International Conference on Advances in Geographic Information Systems10.1145/3557915.3561042(1-4)Online publication date: 1-Nov-2022
        • (2022)A Survey on Spatio-temporal Data Analytics SystemsACM Computing Surveys10.1145/350790454:10s(1-38)Online publication date: 10-Nov-2022
        • (2022)An RDMA-enabled In-memory Computing Platform for R-tree on ClustersACM Transactions on Spatial Algorithms and Systems10.1145/35035138:2(1-26)Online publication date: 12-Feb-2022
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media