Abstract
Outer joins are ubiquitous in many workloads but are sensitive to load-balancing problems. Current approaches mitigate such problems caused by data skew by using (partial) replication. However, contemporary replication-based approaches (1) introduce overhead, since they usually result in redundant data movement, (2) are sensitive to parameter tuning and value of data skew and (3) typically require that one side is small. In this paper, we propose a novel parallel algorithm, Redistribution and Efficient Query with Counters (REQC), aimed at robustness in terms of size of join sides, variation in skew and parameter tuning. Experimental results demonstrate that our algorithm is faster, more robust and less demanding in terms of network bandwidth, compared to the state-of-the-art.
Chapter PDF
Similar content being viewed by others
References
Galindo-Legaria, C., Rosenthal, A.: Outerjoin simplification and reordering for query optimization. ACM Transactions on Database Systems (TODS) 22(1), 43–74 (1997)
Rao, J., Pirahesh, H., Zuzarte, C.: Canonical abstraction for outerjoin optimization. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 671–682. ACM (2004)
Bhargava, G., Goel, P., Iyer, B.: Hypergraph based reorderings of outer join queries with complex predicates. ACM SIGMOD Record 24(2), 304–315 (1995)
Xu, Y., Kostamaa, P.: A new algorithm for small-large table outer joins in parallel DBMS. In: Proceedings of the 26th IEEE International Conference on Data Engineering, ICDE 2010, pp. 1018–1024 (2010)
De Witt, D., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical skew handling in parallel joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, VLDB 1992, pp. 27–40 (1992)
AI Hajj Hassan, M., Bamha, M.: An efficient parallel algorithm for evaluating join queries on heterogeneous distributed systems. In: Proceedings of The 16th annual IEEE International Conference on High Performance Computing, HiPC 2009, pp. 350–358 (2009)
Kotoulas, S., Oren, E., van Harmelen, F.: Mind the data skew: distributed inferencing by speeddating in elastic regions. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 531–540. ACM (2010)
Kim, C., Kaldewey, T., Lee, V.W., Sedlar, E., Nguyen, A.D., Satish, N., Chhugani, J., Di Blas, A., Dubey, P.: Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow. 2(2), 1378–1389 (2009)
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 37–48. ACM (2011)
Balkesen, C., Teubner, J., Öszu, G.A., Main-memory, M.T.: Hash joins on multi-core CPUs: Tuning to the underlying hardware. In: Proceedings of the 29th International Conference on Data Engineering, ICDE 2013, pp. 362–373 (2013)
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 511–524. ACM (2008)
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Zhang, X., Kurc, T., Pan, T., Catalyurek, U., Narayanan, S., Wyckoff, P., Saltz, J.: Strategies for using additional resources in parallel hash-based join algorithms. In: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, HPDC 2004, pp. 4–13 (2004)
Xu, Y., Kostamaa, P., Zhou, X., Chen, L.: Handling data skew in parallel joins in shared-nothing systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1043–1052. ACM (2008)
Cheng, L., Kotoulas, S., Ward, T., Theodoropoulos, G.: Efficient handling skew in outer joins on distributed systems. In: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014, pp. 295–304 (2014)
Hill, G., Ross, A.: Reducing outer joins. The VLDB Journal 18(3), 599–610 (2009)
Larson, P.Å., Zhou, J.: View matching for outer-join views. The VLDB Journal 16(1), 29–53 (2007)
Koloniari, G., Pitoura, E.: Peer-to-peer management of XML data: Issues and research challenges. ACM Sigmod Record 34(2), 6–17 (2005)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Cheng, L., Kotoulas, S., Ward, T., Theodoropoulos, G.: QbDJ: A novel framework for handling skew in parallel join processing on distributed memory. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications, HPCC 2013, pp. 1519–1527 (2013)
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cheng, L., Kotoulas, S., Ward, T.E., Theodoropoulos, G. (2014). Robust and Efficient Large-Large Table Outer Joins on Distributed Infrastructures. In: Silva, F., Dutra, I., Santos Costa, V. (eds) Euro-Par 2014 Parallel Processing. Euro-Par 2014. Lecture Notes in Computer Science, vol 8632. Springer, Cham. https://doi.org/10.1007/978-3-319-09873-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-09873-9_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09872-2
Online ISBN: 978-3-319-09873-9
eBook Packages: Computer ScienceComputer Science (R0)