article

High-performance XML modeling of parallel queries based on MapReduce framework

Authors:

Hongwei LuAuthors Info & Claims

Cluster Computing, Volume 19, Issue 4

Pages 1975 - 1986

https://doi.org/10.1007/s10586-016-0628-z

Published: 01 December 2016 Publication History

Abstract

With the increasing of data at an incredible rate, the development of cloud computing technologies is of critical importance to the advances of researches. MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. Traditional parallel XML parsing and indexing approaches are inadequate for processing large-scale XML datasets on clusters and; therefore, we propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labeling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labeling algorithm and a distributed hierarchical index using DHTs. More importantly, we design an advanced two phase MapReduce solution that is able to efficiently address the issues of labeling, indexing, and query processing on big XML data. The first MapReduce phase applies filtering, labeling, index building techniques, in which each DataNode performs elements labeling using a map function and a reduce function to merge and build indexes. In the second phase, local XML queries in multiple partitions are performed in parallel using index-table-enabled B-SLCA. Our experimental results show the efficiency and effectiveness of our proposed parallel XML data approach using MapReduce Framework.

References

[1]

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)

Digital Library

[2]

Fegaras, L., Li, C., Philip, J.J.: Xml query optimization in map-reduce. In: WebDB (2011)

[3]

Yang, D.D., Wei, Z.Q., Yang, Y.Q.: A novel implementation of a Hash function based on XML DOM parser. In: Cyber-Enabled Distributed Computing and Knowledge, Discovery, pp. 5---8 (2015)

Digital Library

[4]

Choi, H., Lee, K.-H., Lee, Y.-J.: Parallel labeling of massive xml data with mapreduce. J. Supercomput. 67, 408---437 (2013)

Digital Library

[5]

Zhou, J., Bao, Z., Meng, X.: Efficient query processing for xml keyword queries based on the idlist index. VLDB J. 23, 1---26 (2013)

Digital Library

[6]

Xu, L., Ling, T., Bao, Z.: Dde: from dewey to a fully dynamic xml labeling scheme. In: 2009 ACM SIGMOD International Conference on Management of data, pp. 719---730 (2009)

Digital Library

[7]

Camacho-Rodriguez, J., Colazzo, D., Manolescu, I.: Building large xml stores in the amazon cloud. In: Data Engineering Workshops (ICDEW), pp. 151---158 (2012)

Digital Library

[8]

Chen, G., Vo, H.T., Ooi, B.C.: A framework for supporting dbms-like indexes in the cloud. VLDB 4, 702---713 (2011)

Digital Library

[9]

Ottaviano, G., Grossi, R.: Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM international conference on Information and Knowledge Management, pp. 1485---1494 (2011)

Digital Library

[10]

Feng, J., Li, G.: Efficient fuzzy type-ahead search in xml data. IEEE Trans. Knowl. Data Eng. 24, 882---895 (2012)

Digital Library

[11]

Li, J.F.G., Li, C., Zhou, L.: Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents. Inf. Sci. 179, 3745---3762 (2009)

Digital Library

[12]

Chen, L.J., Papakonstantinou, Y.: Supporting top-k keyword search in xml databases. In: ICDE (2010)

[13]

Ling, Y., Xu, G.: A distributed keyword search algorithm in xml databases using mapreduce. Comput. Inform. Cybern. Appl. 107, 1307---1316 (2012)

[14]

Zhang, C., Ma, Q., Wang, X., Zhou, A.: Distributed slca-based xml keyword search by map-reduce. Database Syst. Adv. Appl. 6193, 386---397 (2010)

Digital Library

[15]

Zhou, M., Hu, H., Zhou, M.: Search xml data by slca on a mapreduce cluster. In: IUCS, pp. 84---89 (2010)

[16]

Zinn, D., Bowers, S., Kohler, S., Ludascher, B.: Parallelizing xml data-streaming workflows via mapreduce. J. Comput. Syst. Sci. 76, 447463 (2010)

Digital Library

[17]

Fadika, Z., Head, M.R., Govindaraju, M.: Parallel and distributed approach for processing large-scale xml datasets. In: 10th IEEE/ACM International Conference on Grid Computing, pp. 105---112 (2009)

[18]

Y. Zhang, Q. L. Li and B. Liu. MapReduce implementation of XML keyword search algorithm. In: 2015 IEEE International Conference on Smart City, pp. 721---728 (2015)

[19]

Wang, X.W.W., Zhou, A.: Hash-search: an efficient slca-based keyword search algorithm on xml documents. In: DASFAA, p. 496510 (2009)

Digital Library

[20]

Lee, k, Choi, H., Moon, B.: Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec. 40, 11---20 (2012)

Digital Library

[21]

Hsu, W.-C., Shih, H.-C.: A cloud computing implementation of xml indexing method using hadoop. In: Intelligent Information and Database Systems, vol. 7198, pp. 256---265 (2012)

Digital Library

[22]

Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. VLDB 7, 145---156 (2014)

Digital Library

Cited By

Jeong HPark BPark MKim KChoi K(2019)Big data and rule-based recommendation system in Internet of ThingsCluster Computing10.1007/s10586-017-1078-y22:1(1837-1846)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s10586-017-1078-y

Recommendations

Efficient Querying Distributed Big-XML Data using MapReduce

MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. This paper proposed an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. The authors' solution seamlessly integrates ...
An Efficient Parallel Approach of Parsing and Indexing for Large-Scale XML Datasets
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. Traditional parallel XML parsing and indexing approaches are inadequate for processing large-scale XML datasets on clusters and, therefore, we propose ...
Parallel labeling of massive XML data with MapReduce

The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional ...

Comments

Information & Contributors

Information

Published In

cover image Cluster Computing

Cluster Computing Volume 19, Issue 4

December 2016

625 pages

ISSN:1386-7857

Issue’s Table of Contents

Copyright © Copyright © 2016 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jeong HPark BPark MKim KChoi K(2019)Big data and rule-based recommendation system in Internet of ThingsCluster Computing10.1007/s10586-017-1078-y22:1(1837-1846)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s10586-017-1078-y

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents