Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

ScalaGiST: scalable generalized search trees for mapreduce systems [innovative systems paper]

Published: 01 October 2014 Publication History

Abstract

MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST - scalable generalized search tree that can be seamlessly integrated with Hadoop, together with a cost-based data access optimizer for efficient query processing at run-time. ScalaGiST provides extensibility in terms of data and query types, hence is able to support unconventional queries (e.g., multi-dimensional range and k-NN queries) in MapReduce systems, and can be dynamically deployed in large cluster environments for handling big users and data.
We have built ScalaGiST and demonstrated that it can be easily instantiated to common B+-tree and R-tree indexes yet for dynamic distributed environments. Our extensive performance study shows that ScalaGiST can provide efficient write and read performance, elastic scaling property, as well as effective support for MapReduce execution of ad-hoc analytic queries. Performance comparisions with recent proposals of specialized distributed index structures, such as SpatialHadoop, Data Mapping, and RT-CAN further confirm its efficiency.

References

[1]
TPC-H benchmark. {Online} http://www.tpc.org/tpch.
[2]
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.
[3]
M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. PVLDB, 1(1):598--609, 2008.
[4]
R. Cattell. Scalable sql and nosql data stores. SIGMOD Rec., 39(4):12--27, 2011.
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI, pages 205--218, 2006.
[6]
G. Chen, H. T. Vo, S. Wu, B. C. Ooi, and M. T. Özsu. A framework for supporting dbms-like indexes in the cloud. PVLDB, 4(11):702--713, 2011.
[7]
R. Choubey, L. Chen, and E. A. Rundensteiner. Gbi: A generalized r-tree bulk-insertion strategy. In Advances in Spatial Databases, pages 91--108. Springer, 1999.
[8]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proc. of VLDB, pages 426--435, 1997.
[9]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proc. of SoCC, pages 143--154, 2010.
[10]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. of OSDI, 2004.
[11]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In Proc. of SOSP, pages 205--220, 2007.
[12]
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1-2):515--529, 2010.
[13]
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only aggressive elephants are fast elephants. PVLDB, 5(11):1591--1602, 2012.
[14]
A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. PVLDB, 6(12):1230--1233, 2013.
[15]
M. Y. Eltabakh, F. Özcan, Y. Sismanis, P. J. Haas, H. Pirahesh, and J. Vondrak. Eagle-eyed elephant: Split-oriented indexing in hadoop. In Proc. of EDBT, pages 89--100, 2013.
[16]
J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized search trees for database systems. In Proc. of VLDB, pages 562--573, 1995.
[17]
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: an in-depth study. PVLDB, 3(1-2), 2010.
[18]
A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, 2010.
[19]
F. Li, B. C. Ooi, M. Ozsu, and S. Wu. Distributed data management using mapreduce. ACM Computing Survey, 2014.
[20]
H. Liao, J. Han, and J. Fang. Multi-dimensional index on hadoop distributed file system. In Proc. of NAS, pages 240--249, 2010.
[21]
G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ottawa, Canada, 1966.
[22]
S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi. Md-hbase: A scalable multi-dimensional data infrastructure for location aware services. In Proc. of MDM, 2011.
[23]
A. Papadopoulos and Y. Manolopoulos. Performance of nearest neighbor queries in r-trees. In ICDT, 1997.
[24]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. of SIGMOD, pages 165--178, 2009.
[25]
Y. Tao, J. Zhang, D. Papadias, and N. Mamoulis. An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces. IEEE Trans. on Knowl. and Data Eng., 16(10):1169--1184, Oct. 2004.
[26]
J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In Proc. of SIGMOD, pages 591--602, 2010.
[27]
S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. PVLDB, 3(1):1207--1218, 2010.
[28]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010.
[29]
H. Zhao, S. Yang, Z. Chen, S. Jin, H. Yin, and L. Li. Mapreduce model-based optimization of range queries. In Proc. of FSKD, pages 2478--2492, 2012.

Cited By

View all
  • (2022)Incremental partitioning for efficient spatial data analyticsProceedings of the VLDB Endowment10.14778/3494124.349415015:3(713-726)Online publication date: 4-Feb-2022
  • (2022)Scalable computational geometry in MapReduceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0534-528:4(523-548)Online publication date: 10-Mar-2022
  • (2020)Using Deep Learning for Big Spatial Data PartitioningACM Transactions on Spatial Algorithms and Systems10.1145/34021267:1(1-37)Online publication date: 12-Aug-2020
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 14
October 2014
244 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2014
Published in PVLDB Volume 7, Issue 14

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Incremental partitioning for efficient spatial data analyticsProceedings of the VLDB Endowment10.14778/3494124.349415015:3(713-726)Online publication date: 4-Feb-2022
  • (2022)Scalable computational geometry in MapReduceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0534-528:4(523-548)Online publication date: 10-Mar-2022
  • (2020)Using Deep Learning for Big Spatial Data PartitioningACM Transactions on Spatial Algorithms and Systems10.1145/34021267:1(1-37)Online publication date: 12-Aug-2020
  • (2019)Comparing synopsis techniques for approximate spatial data analysisProceedings of the VLDB Endowment10.14778/3342263.334263512:11(1583-1596)Online publication date: 1-Jul-2019
  • (2018)Detecting skewness of big spatial data in SpatialHadoopProceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/3274895.3274923(432-435)Online publication date: 6-Nov-2018
  • (2018)ST-HadoopGeoinformatica10.1007/s10707-018-0325-622:4(785-813)Online publication date: 1-Oct-2018
  • (2017)The era of big spatial dataProceedings of the VLDB Endowment10.14778/3137765.313782810:12(1992-1995)Online publication date: 1-Aug-2017
  • (2017)Experimental evaluation of selectivity estimation on big spatial dataProceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data10.1145/3080546.3080553(1-6)Online publication date: 14-May-2017
  • (2016)The Era of Big Spatial DataFoundations and Trends in Databases10.1561/19000000546:3-4(163-273)Online publication date: 28-Dec-2016
  • (2016)Dynamic multidimensional index for large-scale cloud dataJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-016-0060-15:1(1-11)Online publication date: 1-Dec-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media