Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2675743.2771827acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Efficient key grouping for near-optimal load balancing in stream processing systems

Published: 24 June 2015 Publication History

Abstract

Key grouping is a technique used by stream processing frameworks to simplify the development of parallel stateful operators. Through key grouping a stream of tuples is partitioned in several disjoint sub-streams depending on the values contained in the tuples themselves. Each operator instance target of one sub-stream is guaranteed to receive all the tuples containing a specific key value. A common solution to implement key grouping is through hash functions that, however, are known to cause load imbalances on the target operator instances when the input data stream is characterized by a skewed value distribution. In this paper we present DKG, a novel approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution. DKG starts from the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items to achieve a near-optimal load balance. We provide theoretical approximation bounds for the quality of the mapping derived by DKG and show, through both simulations and a running prototype, its impact on stream processing applications.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the 28th annual ACM Symposium on Theory of computing (STOC), 1996.
[2]
E. Anceaume, Y. Busnel, and B. Sericola. Uniform node sampling service robust against collusions of malicious nodes. In Proceedings of the 43rd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013.
[3]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM), 2002.
[4]
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), volume 1, 1999.
[5]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1), 2004.
[6]
DEBS 2013. Grand Challenge. http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails.
[7]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2), 1985.
[8]
B. Gedik. Partitioning functions for stateful data parallelism in stream processing. The VLDB Journal, 23(4), 2014.
[9]
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2), 1969.
[10]
S. Guha and A. McGregor. Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5), 2009.
[11]
D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct element problem. In Proceedings of the Symposium on Principles of Databases (PODS), 2010.
[12]
R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. World Wide Web, 8(2), 2005.
[13]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory (ICDT), 2005.
[14]
Muthukrishnan. Data Streams: Algorithms and Applications. Now Publishers Inc., 2005.
[15]
M. A. U. Nasir, G. D. F. Morales, D. G. Soriano, N. Kourtellis, and M. Serafini. The power of both choices: Practical load balancing for distributed stream processing engines. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE), 2015.
[16]
O. Pearce, T. Gamblin, B. R. de Supinski, M. Schulz, and N. M. Amato. Quantifying the effectiveness of load balance algorithms. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS), 2012.
[17]
The Apache Software Foundation. Apache Storm. http://storm.apache.org.

Cited By

View all
  • (2024)Storm-Based Scheduling Method for Streaming Computing Engine2024 Prognostics and System Health Management Conference (PHM)10.1109/PHM61473.2024.00012(20-28)Online publication date: 28-May-2024
  • (2024)Load Balancing and Generalized Split State Reconciliation in Event Driven Systems2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00031(127-132)Online publication date: 16-Sep-2024
  • (2024)PA-SPS: A Predictive Adaptive Approach for an Elastic Stream Processing SystemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104940(104940)Online publication date: Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '15: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems
June 2015
385 pages
ISBN:9781450332866
DOI:10.1145/2675743
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data streaming
  2. key grouping
  3. load balancing
  4. stream processing

Qualifiers

  • Research-article

Funding Sources

  • Italian Ministry of Education, University and Research

Conference

DEBS '15

Acceptance Rates

Overall Acceptance Rate 145 of 583 submissions, 25%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Storm-Based Scheduling Method for Streaming Computing Engine2024 Prognostics and System Health Management Conference (PHM)10.1109/PHM61473.2024.00012(20-28)Online publication date: 28-May-2024
  • (2024)Load Balancing and Generalized Split State Reconciliation in Event Driven Systems2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00031(127-132)Online publication date: 16-Sep-2024
  • (2024)PA-SPS: A Predictive Adaptive Approach for an Elastic Stream Processing SystemJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104940(104940)Online publication date: Jun-2024
  • (2024)Adaptive key partitioning in distributed stream processingCCF Transactions on High Performance Computing10.1007/s42514-023-00179-36:2(164-178)Online publication date: 12-Jan-2024
  • (2023)A Frequency-aware Grouping Strategy for Stateful Operators in Distributed Stream Processing Systems2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00027(126-131)Online publication date: 17-Dec-2023
  • (2023)A survey on the evolution of stream processing systemsThe VLDB Journal10.1007/s00778-023-00819-833:2(507-541)Online publication date: 22-Nov-2023
  • (2022)DaltonProceedings of the VLDB Endowment10.14778/3570690.357069916:3(491-504)Online publication date: 1-Nov-2022
  • (2022)POTUS: Predictive Online Tuple Scheduling for Data Stream Processing SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2020.303257710:4(2863-2875)Online publication date: 1-Oct-2022
  • (2021)Autonomous resource management in distributed stream processing systemsProceedings of the 22nd International Middleware Conference: Doctoral Symposium10.1145/3491087.3493680(19-22)Online publication date: 6-Dec-2021
  • (2021)TriskProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487010(214-228)Online publication date: 1-Nov-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media