Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3139958.3140019acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing

Published: 07 November 2017 Publication History

Abstract

Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.

References

[1]
Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A High Performance Spatial Data Warehousing System over Mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020.
[2]
Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri.
[3]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[4]
Ahmed Eldawy. 2014. SpatialHadoop: Towards Flexible and Scalable Spatial Processing Using Mapreduce. In Proceedings of the 2014 SIGMOD PhD Symposium (SIGMOD'14 PhD Symposium). ACM, New York, NY, USA, 46--50.
[5]
Roger Frye and Mark McKenney. 2015. Big Data Storage Techniques for Spatial Databases: Implications of Big Data Architecture on Spatial Query Processing. In Information Granularity, Big Data, and Computational Intelligence. Springer, 297--323.
[6]
Paul Jaccard. 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.
[7]
Jinxuan Wu Jia Yu, Mohamed Sarwat. 2015. GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In Proceedings of the 2015 International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2015).
[8]
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. Skewtune: mitigating skew in mapreduce applications. In Proc. 2012 ACM SIGMOD International Conference on Management of Data.
[9]
YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2010. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proc. 1st ACM symposium on Cloud computing.
[10]
Open Street Map. 2017. OSM. (2017). http://www.openstreetmap.org
[11]
Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2011. MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. In Proceedings of the 2011 IEEE 12th International Conference on Mobile Data Management - Volume 01 (MDM '11). IEEE Computer Society, Washington, DC, USA, 7--16.
[12]
Apache Spark. 2017. Spark Web. (2017). http://spark.apache.org
[13]
Mingjie Tang, Yongyang Yu, Qutaibah M Malluhi, Mourad Ouzzani, and Walid G Aref. 2016. Locationspark: a distributed in-memory data management system for big spatial data. Proceedings of the VLDB Endowment 9, 13 (2016), 1565--1568.
[14]
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient In-Memory Spatial Analytics. In (To Appear) In Proceedings of 35th ACM SIGMOD International Conference on Management of Data (SIGMOD'16).
[15]
Simin You and Jianting Zhang. 2015. Large-Scale Spatial Join Query Processing in Cloud. Technical Report. City University of New York.
[16]
Simin You, Jianting Zhang, and L Gruenwald. 2015. Large-scale spatial join query processing in cloud. In IEEE CloudDM workshop (To Appear) http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.
[17]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301
[18]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113

Cited By

View all
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
  • (2024)GeoEkuiper: A Cloud-Cooperated Geospatial Edge Stream Processing Engine for Resource-Constrained IoT Devices With Higher ThroughputIEEE Internet of Things Journal10.1109/JIOT.2024.340816611:18(30094-30113)Online publication date: 15-Sep-2024
  • (2024)CUPID: An efficient spatio-temporal data engineFuture Generation Computer Systems10.1016/j.future.2024.07.031161(531-544)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2017
677 pages
ISBN:9781450354905
DOI:10.1145/3139958
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. In-Memory processing
  2. MapReduce
  3. Spark
  4. Spatial processing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NCI
  • NLM

Conference

SIGSPATIAL'17
Sponsor:

Acceptance Rates

SIGSPATIAL '17 Paper Acceptance Rate 39 of 193 submissions, 20%;
Overall Acceptance Rate 220 of 1,116 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)High-Performance Spatial Data Analytics: Systematic R&D for Scale-Out and Scale-Up Solutions from the Past to NowProceedings of the VLDB Endowment10.14778/3685800.368591217:12(4507-4520)Online publication date: 1-Aug-2024
  • (2024)GeoEkuiper: A Cloud-Cooperated Geospatial Edge Stream Processing Engine for Resource-Constrained IoT Devices With Higher ThroughputIEEE Internet of Things Journal10.1109/JIOT.2024.340816611:18(30094-30113)Online publication date: 15-Sep-2024
  • (2024)CUPID: An efficient spatio-temporal data engineFuture Generation Computer Systems10.1016/j.future.2024.07.031161(531-544)Online publication date: Dec-2024
  • (2024)A learning-based framework for spatial join processing: estimation, optimization and tuningThe VLDB Journal10.1007/s00778-024-00836-133:4(1155-1177)Online publication date: 13-Feb-2024
  • (2023)Efficient PRAM and Practical GPU Algorithms for Large Polygon Clipping with Degenerate Cases2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00060(579-591)Online publication date: May-2023
  • (2023)A mediation system for continuous spatial queries on a unified schema using Apache SparkBig Earth Data10.1080/20964471.2023.22758548:1(115-141)Online publication date: 9-Nov-2023
  • (2023)WDCIP: spatio-temporal AI-driven disease control intelligent platform for combating COVID-19 pandemicGeo-spatial Information Science10.1080/10095020.2023.2182236(1-25)Online publication date: 4-Jul-2023
  • (2023)Efficient spatial queries over complex polygons with hybrid representationsGeoInformatica10.1007/s10707-023-00508-228:3(459-497)Online publication date: 27-Dec-2023
  • (2023)Dynamic Data-Driven Application Systems for Reservoir Simulation-Based Optimization: Lessons Learned and Future TrendsHandbook of Dynamic Data Driven Applications Systems10.1007/978-3-031-27986-7_11(287-330)Online publication date: 6-Sep-2023
  • (2022)A PID-Based kNN Query Processing Algorithm for Spatial DataSensors10.3390/s2219765122:19(7651)Online publication date: 9-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media