Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389713acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Prompt: Dynamic Data-Partitioning for Distributed Micro-batch Stream Processing Systems

Published: 31 May 2020 Publication History

Abstract

Advances in real-world applications require high-throughput processing over large data streams. Micro-batching has been proposed to support the needs of these applications. In micro-batching, the processing and batching of the data are interleaved, where the incoming data tuples are first buffered as data blocks, and then are processed collectively using parallel function constructs (e.g., Map-Reduce). The size of a micro-batch is set to guarantee a certain response-time latency that is to conform to the application's service-level agreement. In contrast to tuple-at-a-time data stream processing, micro-batching has the potential to sustain higher data rates. However, existing micro-batch stream processing systems use basic data-partitioning techniques that do not account for data skew and variable data rates. Load-awareness is necessary to maintain performance and to enhance resource utilization. A new data partitioning scheme termed Prompt is presented that leverages the characteristics of the micro-batch processing model. In the batching phase, a frequency-aware buffering mechanism is introduced that progressively maintains run-time statistics, and provides online key-based sorting as data tuples arrive. Because achieving optimal data partitioning is NP-Hard in this context, a workload-aware greedy algorithm is introduced that partitions the buffered data tuples efficiently for the Map stage. In the processing phase, a load-aware distribution mechanism is presented that balances the size of the input to the Reduce stage without incurring inter-task communication overhead. Moreover, Prompt elastically adapts resource consumption according to workload changes. Experimental results using real and synthetic data sets demonstrate that Prompt is robust against fluctuations in data distribution and arrival rates. Furthermore, Prompt achieves up to 200% improvement in system throughput over state-of-the-art techniques without degradation in latency.

Supplementary Material

MP4 File (3318464.3389713.mp4)
Presentation Video

References

[1]
https://spark.apache.org/.
[2]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. F. Indez-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. In VLDB, 2015.
[3]
A. M. Aly, A. S. Abdelhamid, A. R. Mahmood, W. G. Aref, M. S. Hassan, H. Elmeleegy, and M. Ouzzani. A demonstration of aqwa: Adaptive query-workload-aware partitioning of big spatial data. In VLDB, 2015.
[4]
A. M. Aly, A. Sallam, B. M. Gnanasekaran, L.-V. Nguyen-Dinh, W. G. Aref, M. Ouzzani, and A. Ghafoor. M3: Stream processing on main-memory mapreduce. In ICDE, 2012.
[5]
M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Structured streaming: A declarative api for real-time applications in apache spark. In Sigmod, 2018.
[6]
Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. In SIAMJ.Comput., 1999.
[7]
C. Balkesen and N. Tatbul. Scalable data partitioning techniques for parallel sliding window processing over data streams. In 8th International Workshop on Data Management for Sensor Networks (DMSN), 2011.
[8]
C. Balkesen, N. Tatbul, and M. T. Ozsu. Adaptive input admission and management for parallel stream processing. In DEBS, 2013.
[9]
B. Byholm and I. Porres. Fast algorithms for fragmentable items bin packing. In TUCS Technical Report, No 1181, 2017.
[10]
C.A. Mandal, P.P. Chakrabarti, and S. Ghose. Complexity of fragmentable object bin packing and an application. In Computers and Mathematics with Applications. ELSEVIER, 1998.
[11]
Y. Chen, Z. Liu, T. Wang, and L. Wang. Load balancing in mapreduce based on data locality. In ICA3PP. Springer, 2014.
[12]
T. Das, Y. Zhong, I. Stoica, and S. Shenker. Adaptive stream processing using dynamic batch sizing. In SoCC, 2014.
[13]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
[14]
D. DeWitt and M. Stonebraker. Mapreduce: A major step backwards. In Database Column, 2008.
[15]
M. C. E. Ryvkina, A. S. Maskey and S. Zdonik. Revision processing in a stream processing engine: A high-level design. In ICDE, 2006.
[16]
L. Epstein, L. M. Favrholdt, and J. S. Kohrt. Comparing online algorithms for bin packing problems. In Journal of Scheduling, 2012.
[17]
Y. Gao, Y. Zhou, B. Zhou, L. Shi, and J. Zhang. Handling data skew in mapreduce cluster by using partition tuning. In Journal of Healthcare Engineering. Hindawi, 2017.
[18]
B. Gedik. Partitioning functions for stateful data parallelism in stream processing. In VLDB Journal, volume 23,4, pages 75--87, 2014.
[19]
B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in mapreduce. In International Conference on Cloud Computing and Services Science, 2011.
[20]
B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In ICDE, 2012.
[21]
B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched stream processing for data intensive distributed computing. In SoCC, 2010.
[22]
N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, and C. Venkatramani. Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In SIGMOD, 2006.
[23]
K. Jansen, S. Kratsch, D. Marx, and I. Schlotter. Bin packing with fixed number of bins revisited. In Journal of Computer and System Sciences. Academic Press, 2013.
[24]
D. S. Johnson, A. Demers, J. D. Ullman, M. R. Gareyi, and R. L. Grahamii. Worst-case performance bounds for simple one-dimensional packing algorithms. In Journal of Computing. SIAM, 1974.
[25]
N. R. Katsipoulakis, A. Labrinidis, and P. K. Chrysanthis. A holistic view of stream partitioning costs. In VLDB, 2017.
[26]
L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In IEEE, 2012.
[27]
Y. Kwon, K. Ren, M. Balazinska, and B. Howe. Managing skew in hadoop. In TCDE, 2013.
[28]
Y. Le, J. Liu, F. Ergun, and D. Wang. Online load balancing for mapreduce with skewed data input. In INFOCOM, 2014.
[29]
B. LeCun, T. Mautor, F. Quessette, and M.-A. Weisser. Bin packing with fragmentable items: Presentation and approximations. In Theoretical Computer Science. ELSEVIER, 2015.
[30]
J. Li, Y. Liu, J. Pan, P. Zhang, W. Chen, and L. Wang. Map-balance-reduce: An improved parallel programming model for load balancing of mapreduce. In FGCS. ELSEVIER, 2017.
[31]
M. Liroz-Gistau, R. Akbarinia, D. Agrawal, E. Pacitti, and P. Valduriez. Data partitioning for minimizing transferred data in mapreduce. In Globe, 2013.
[32]
M. Liroz-Gistau, R. Akbarinia, E. Pacitti, F. Porto, and P. Valduriez. Dynamic workload-based partitioning for large-scale databases. In DEXA, pages 183--190, 2012.
[33]
N. Menakerman and R. Rom. Bin packing with item fragmentation. In WADS. Springer, 2001.
[34]
J. Myung, J. Shim, J. Yeon, and Sang-goo. Handling data skew in join algorithms using mapreduce. In Expert Systems with Applications, 2016.
[35]
M. A. U. Nasir, G. D. F. Morales, N. Kourtellis, and M. Serafini. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE, 2016.
[36]
M. A. U. Nasir, G. D. F. Morales, D. G. Soriano, N. Kourtellis, and M. Serafini. The power of both choices: Practical load balancing for distributed stream processing engines. In ICDE, 2015.
[37]
K. Pienkosz. Bin packing with restricted item fragmentation. In Operations and Systems Research Conference, 2014.
[38]
H. Shachnai, T. Tamir, and O. Yehezkely. Approximation schemes for packing with item fragmentation. In Theory of Computing Systems, 2008.
[39]
H. Shachnai and O. Yehezkely. Fast asymptotic fptas for packing fragmentable items with costs. In FCT, 2007.
[40]
D. S.Johnson. Fast algorithms for bin packing. In Journal of Computer and System Sciences. ELSEVIER, 1974.
[41]
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, 2014.
[42]
S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi, M. J. Franklin, B. Recht, and I. Stoica. Drizzle: Fast and adaptable stream processing at scale. In SOSP, 2017.
[43]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.
[44]
E. Zeitler and T. Risch. Massive scale-out of expensive continuous queries. In VLDB, 2011.
[45]
Q. Zhang, Y. Song, R. R. Routray, and W. Shi. Adaptive block and batch sizing for batched stream processing system. In IEEE International Conference on Autonomic Computing, 2016.

Cited By

View all
  • (2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
  • (2024)Adaptive key partitioning in distributed stream processingCCF Transactions on High Performance Computing10.1007/s42514-023-00179-36:2(164-178)Online publication date: 12-Jan-2024
  • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data partitioning
  2. distributed data processing
  3. elastic stream processing
  4. micro-batch stream processing

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)101
  • Downloads (Last 6 weeks)18
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FlexSP:(1 + β)-Choice based Flexible Stream Partitioning for Stateful OperatorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673157(732-741)Online publication date: 12-Aug-2024
  • (2024)Adaptive key partitioning in distributed stream processingCCF Transactions on High Performance Computing10.1007/s42514-023-00179-36:2(164-178)Online publication date: 12-Jan-2024
  • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
  • (2023)SASPAR: Shared Adaptive Stream Partitioning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00076(922-935)Online publication date: Apr-2023
  • (2023)Dynamic Data Partitioning in the WAFL File System2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363474(1-7)Online publication date: 25-Sep-2023
  • (2023)Micro-batch and data frequency for stream processing on multi-coresThe Journal of Supercomputing10.1007/s11227-022-05024-y79:8(9206-9244)Online publication date: 9-Jan-2023
  • (2022)DaltonProceedings of the VLDB Endowment10.14778/3570690.357069916:3(491-504)Online publication date: 1-Nov-2022
  • (2022)Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP55904.2022.00011(10-17)Online publication date: Mar-2022
  • (2022)An Adaptive Scheduling Framework for Distributed Key-Value Stores Using RDMA2022 8th Annual International Conference on Network and Information Systems for Computers (ICNISC)10.1109/ICNISC57059.2022.00124(605-611)Online publication date: Sep-2022
  • (2022)Adaptivity in continuous massively parallel distance-based outlier detectionComputing10.1007/s00607-022-01101-5104:12(2659-2684)Online publication date: 12-Jul-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media