article

Free access

A survey of large-scale analytical query processing in MapReduce

Authors:

Christos Doulkeridis,

Kjetil NØrvågAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 23, Issue 3

Pages 355 - 380

https://doi.org/10.1007/s00778-013-0319-9

Published: 01 June 2014 Publication History

Abstract

Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state of the art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target. Based on the proposed taxonomy, a classification of existing research is provided focusing on the optimization objective. Concluding, we outline interesting directions for future parallel data processing systems.

References

[1]

Abadi, D.J.: Data management in the cloud: limitations and opportunities. IEEE Data Eng. Bull. 32(1), 3---12 (2009)

[2]

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922---933 (2009)

[3]

Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 1---8 (2011)

[4]

Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 498---509 (2012)

[5]

Afrati, F.N., Ullman, J.D.: Optimizing joins in a Map-Reduce environment. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 99---110 (2010)

[6]

Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a Map-Reduce environment. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1282---1298 (2011)

[7]

Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it's done: interactive queries on very large data. Proc. VLDB Endow. (PVLDB) 5(12), 1902---1905 (2012)

[8]

Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J.: Re-optimizing data-parallel computing. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 21:1---21:14 (2012)

[9]

Agarwal, S., Panda, A., Mozafari, B., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of European Conference on Computer Systems (EuroSys) (2013)

[10]

Agrawal, D., Das, S., Abbadi, A.E.: Big data and cloud computing: current state and future opportunities. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 530---533 (2011)

[11]

Agrawal, P., Kifer, D., Olston, C.: Scheduling shared scans of large data files. Proc. VLDB Endow. (PVLDB) 1(1), 958---969 (2008)

[12]

Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of Very Large Databases (VLDB), pp. 169---180 (2001)

[13]

Aiyer, A.S., Bautin, M., Chen, G.J., Damania, P., Khemani, P., Muthukkaruppan, K., Ranganathan, K., Spiegelberg, N., Tang, L., Vaidya, M.: Storage infrastructure behind Facebook Messages: using HBase at scale. IEEE Data Eng. Bull. 35(2), 4---13 (2012)

[14]

Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 20:1---20:14 (2012)

[15]

Babu, S.: Towards automatic optimization of MapReduce programs. In: ACM Symposium on Cloud Computing (SoCC), pp. 137---142 (2010)

[16]

Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: IEEE International Conference on Data Mining (ICDM), pp. 731---736 (2010)

[17]

Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 7:1---7:14 (2011)

[18]

Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 975---986 (2010)

[19]

Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1151---1162 (2011)

[20]

Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., Aiyer, A.S.: Apache Hadoop goes realtime at Facebook. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1071---1080 (2011)

[21]

Bu, Y., Borkar, V.R., Carey, M.J., Rosen, J., Polyzotis, N., Condie, T., Weimer, M., Ramakrishnan, R.: Scaling datalog for machine learning on Big Data. The Computing Research Repository (CoRR), abs/1203.0160 (2012)

[22]

Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB) 3(1), 285---296 (2010)

[23]

Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169---190 (2012)

[24]

Candan, K.S., Kim, J.W., Nagarkar, P., Nagendra, M., Yu, R.: RanKloud: scalable multimedia data processing in server clusters. IEEE Multimed. 18(1), 64---77 (2011)

[25]

Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Rec. 39(4), 12---27 (2010)

[26]

Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1---4:26 (2008)

[27]

Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a SQL implementation on the MapReduce framework. Proc. VLDB Endow. (PVLDB) 4(12), 1318---1327 (2011)

[28]

Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. (PVLDB) 3(2), 1459---1468 (2010)

[29]

Chih Yang, H., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1029---1040 (2007)

[30]

Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 313---328 (2010)

[31]

Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!'s hosted data serving platform. Proc. VLDB Endow. (PVLDB) 1(2), 1277---1288 (2008)

[32]

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2004)

[33]

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107---113 (2008)

[34]

Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72---77 (2010)

[35]

Dittrich, J., Quiané-Ruiz, J.-A.: Efficient Big Data processing in Hadoop MapReduce. Proc. VLDB Endow. (PVLDB) 5(12), 2014---2015 (2012)

[36]

Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. (PVLDB) 3(1), 518---529 (2010)

[37]

Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. Proc. VLDB Endow. (PVLDB) 5(11), 1591---1602 (2012)

[38]

Doulkeridis, C., NØrvåg, K.: On saying "enough already!" in MapReduce. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 7:1---7:4 (2012)

[39]

Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative MapReduce. In: International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 810---818 (2010)

[40]

Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. Proc. VLDB Endow. (PVLDB) 5(6), 586---597 (2012)

[41]

Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. (PVLDB) 4(9), 575---585 (2011)

[42]

Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: fast data analysis using coarse-grained distributed memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 689---692 (2012)

[43]

Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. (PVLDB) 5(11), 1268---1279 (2012)

[44]

Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. Proc. VLDB Endow. (PVLDB) 4(7), 419---429 (2011)

[45]

George, L.: HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O'Reilly, Ireland (2011)

[46]

Goodhope, K., Koshy, J., Kreps, J., Narkhede, N., Park, R., Rao, J., Ye, V.Y.: Building LinkedIn's real-time activity data pipeline. IEEE Data Eng. Bull. 35(2), 33---45 (2012)

[47]

Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 486---497 (2012)

[48]

Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 522---533 (2012)

[49]

Hall, A., Bachmann, O., Büssow, R., Ganceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. (PVLDB) 5(11), 1436---1446 (2012)

[50]

He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1199---1208 (2011)

[51]

Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(11), 1111---1122 (2011)

[52]

Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. Proc. VLDB End. (PVLDB) 5(11), 1256---1267 (2012)

[53]

Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of European Conference on Computer Systems (EuroSys), pp. 59---72 (2007)

[54]

Iu, M.-Y., Zwaenepoel, W.: HadoopToSQL: a MapReduce query optimizer. In: Proceedings of European Conference on Computer systems (EuroSys), pp. 251---264 (2010)

[55]

Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. (TODS) 33(2), 7:1---7:38 (2008)

[56]

Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. (PVLDB) 4(6), 385---396 (2011)

[57]

Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. (PVLDB) 3(1), 472---483 (2010)

[58]

Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Eng. (TKDE) 23(9), 1299---1311 (2011)

[59]

Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: ACM Symposium on Cloud Computing (SoCC), pp. 21:1---21:14 (2011)

[60]

Kaldewey, T., Shekita, E.J., Tata, S.: Clydesdale: structured data processing on MapReduce. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 15---25 (2012)

[61]

Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 510---521 (2012)

[62]

Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 618---629 (2012)

[63]

Kornacker, M., Erickson, J.: Cloudera Impala: real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

[64]

Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: ACM Symposium on Cloud Computing (SoCC), pp. 75---86 (2010)

[65]

Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 25---36 (2012)

[66]

Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., Doan, A.: Muppet: MapReduce-style processing of fast data. Proc. VLDB Endow. (PVLDB) 5(12), 1814---1825 (2012)

[67]

Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1028---1039 (2012)

[68]

Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11---20 (2011)

[69]

Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: YSmart: yet another SQL-to-MapReduce translator. In: Proceedings of International Conference on Distributed Computing Systems (ICDCS), pp. 25---36 (2011)

[70]

Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm. O'Reilly, Ireland (2012)

[71]

Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985---996 (2011)

[72]

Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. (TODS) 37(4), 27:1---27:38 (2012)

[73]

Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for MapReduce workflows. Proc. VLDB Endow. (PVLDB) 5(11), 1196---1207 (2012)

[74]

Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 961---972 (2011)

[75]

Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC), pp. 51---62 (2010)

[76]

Logothetis, D., Yocum, K.: Ad-hoc data processing in the cloud. Proc. VLDB Endow. (PVLDB) 1(2), 1472---1475 (2008)

[77]

Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. (PVLDB) 5(10), 1016---1027 (2012)

[78]

Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135---146 (2010)

[79]

McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Biennial Conference on Innovative Data Systems Research (CIDR) (2013)

[80]

Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. (PVLDB) 3(1---2), 330---339 (2010)

[81]

Metwally, A., Faloutsos, C.: V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. (PVLDB) 5(8), 704---715 (2012)

[82]

Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. (PVLDB) 5(11), 1280---1291 (2012)

[83]

Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. (PVLDB) 3(1), 494---505 (2010)

[84]

Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 949---960 (2011)

[85]

Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V.B.N., Sankarasubramanian, V., Seth, S., Tian, C., ZiCornell, T., Wang, X.: Nova: continuous Pig/Hadoop workflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1081---1090 (2011)

[86]

Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1099---1110 (2008)

[87]

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 165---178 (2009)

[88]

Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing (SoCC), pp. 16:1---16:13 (2012)

[89]

Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 4:1---4:13 (2012)

[90]

Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an I/O efficient MapReduce. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1---13:14 (2012)

[91]

Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutor. 13(3), 311---336 (2011)

[92]

Schindler, J.: I/O characteristics of NoSQL databases. Proc. VLDB Endow. (PVLDB) 5(12), 2020---2021 (2012)

[93]

Shim, K.: MapReduce algorithms for Big Data analysis. Proc. VLDB Endow. (PVLDB) 5(12), 2016---2017 (2012)

[94]

Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.A.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB End. (PVLDB) 5(12), 1736---1747 (2012)

[95]

Silva, Y.N., Larson, P.-A., Zhou, J.: Exploiting common subexpressions for cloud query processing. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 1337---1348 (2012)

[96]

Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 693---696 (2012)

[97]

Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: Proceedings of International Workshop on Cloud Intelligence (Cloud-I), pp. 3:1---3:8 (2012)

[98]

Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64---71 (2010)

[99]

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive--a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626---1629 (2009)

[100]

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive--a petabyte scale data warehouse using Hadoop. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 996---1005 (2010)

[101]

Vernica, R., Balmin, A., Beyer, K.S., Ercegovac, V.: Adaptive MapReduce using situation-aware mappers. In: Proceedings of International Conference on Extending Database Technology (EDBT), pp. 420---431 (2012)

[102]

Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 495---506 (2010)

[103]

Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel skyline computation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 227---238 (2008)

[104]

White, T.: Hadoop--The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O'Reilly, Ireland (2012)

[105]

Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing (SoCC), pp. 12:1---12:13 (2011)

[106]

Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: International World Wide Web Conferences (WWW), pp. 131---140 (2008)

[107]

Xin, R., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. The Computing Research Repository (CoRR), abs/1211.6176 (2012)

[108]

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Fast and interactive analytics over Hadoop data with Spark. USENIX; login 37(4), 45---51 (2012)

[109]

Zhang, J., Zhou, H., Chen, R., Fan, X., Guo, Z., Lin, H., Li, J.Y., Lin, W., Zhou, J., Zhou, L.: Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In: Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 22:1---22:14 (2012)

[110]

Zhang, X., Chen, L., Wang, M.: Efficient multiway theta-join processing using MapReduce. Proc. VLDB Endow. (PVLDB) 5(11), 1184---1195 (2012)

[111]

Zhang, Y., Gao, Q., Gao, L., Wang, C.: PrIter: a distributed framework for prioritized iterative computations. In: ACM Symposium on Cloud Computing (SoCC), pp. 13:1---13:14 (2011)

[112]

Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-Å., Chaiken, R., Shakib, D.: SCOPE: parallel databases meet MapReduce. VLDB J. 21(5), 611---636 (2012)

Cited By

Das PXhebraj ARompf TChiba SThüm T(2024)Specializing Data Access in a Distributed File System (Generative Pearl)Proceedings of the 23rd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3689484.3690736(44-52)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3689484.3690736
Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Yankovitch MKolchinsky ISchuster AIves ZBonifati AEl Abbadi A(2022)HYPERSONIC: A Hybrid Parallelization Approach for Scalable Complex Event ProcessingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517829(1093-1107)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517829
Show More Cited By

A survey of large-scale analytical query processing in MapReduce
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Survey on improving the performance of MapReduce in Hadoop
NISS '21: Proceedings of the 4th International Conference on Networking, Information Systems & Security

Hadoop has become the most popular and the most used platform in distributed data processing, Hadoop is also an open-source software that implements the MapReduce model for processing big data, it has taken a large part in scientific research in the ...
Prominence of MapReduce in Big Data Processing
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies

Big Data has come up with aureate haste and a clef enabler for the social business, Big Data gifts an opportunity to create extraordinary business advantage and better service delivery. Big Data is bringing a positive change in the decision making ...

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 23, Issue 3

June 2014

162 pages

ISSN:1066-8888

Issue’s Table of Contents

Copyright © Copyright © 2014 Springer-Verlag Berlin Heidelberg.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
1,192
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)9

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Das PXhebraj ARompf TChiba SThüm T(2024)Specializing Data Access in a Distributed File System (Generative Pearl)Proceedings of the 23rd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3689484.3690736(44-52)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3689484.3690736
Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Yankovitch MKolchinsky ISchuster AIves ZBonifati AEl Abbadi A(2022)HYPERSONIC: A Hybrid Parallelization Approach for Scalable Complex Event ProcessingProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517829(1093-1107)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517829
Yin FShi F(2022)A Comparative Survey of Big Data Computing and HPC: From a Parallel Programming Model to a Cluster ArchitectureInternational Journal of Parallel Programming10.1007/s10766-021-00717-y50:1(27-64)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s10766-021-00717-y
Doulkeridis CVlachou APelekis NTheodoridis Y(2021)A Survey on Big Data Processing Frameworks for Mobility AnalyticsACM SIGMOD Record10.1145/3484622.348462650:2(18-29)Online publication date: 31-Aug-2021
https://dl.acm.org/doi/10.1145/3484622.3484626
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Herodotou HChen YLu J(2020)A Survey on Automatic Parameter Tuning for Big Data Processing SystemsACM Computing Surveys10.1145/338102753:2(1-37)Online publication date: 26-Apr-2020
https://dl.acm.org/doi/10.1145/3381027
Tampakis PDoulkeridis CPelekis NTheodoridis Y(2020)Distributed Subtrajectory Join on Massive DatasetsACM Transactions on Spatial Algorithms and Systems10.1145/33736426:2(1-29)Online publication date: 4-Feb-2020
https://dl.acm.org/doi/10.1145/3373642
Hashem IAnuar NMarjani MAhmed EChiroma HFirdaus AAbdullah MAlotaibi FAli WYaqoob IGani A(2020)MapReduce scheduling algorithms: a reviewThe Journal of Supercomputing10.1007/s11227-018-2719-576:7(4915-4945)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.1007/s11227-018-2719-5
Zeng YZhou YZhou XZheng F(2020)RETRACTED ARTICLE: Fuzzy clustering-based skyline query preprocessing algorithm for large-scale flow data analysisThe Journal of Supercomputing10.1007/s11227-018-2523-276:2(1321-1330)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s11227-018-2523-2
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents