research-article

TR-Spark: Transient Computing for Big Data Analytics

Authors:

Thomas MoscibrodaAuthors Info & Claims

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 484 - 496

https://doi.org/10.1145/2987550.2987576

Published: 05 October 2016 Publication History

Abstract

Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. For various reasons, much of this provisioned server capacity runs at low average utilization, and there is tremendous competitive pressure to increase utilization. Conceptually, the way to increase utilization is clear: Run time-insensitive batch-job workloads as secondary background tasks whenever server capacity is underutilized; and evict these workloads when the server's primary task requires more resources. Big data analytic tasks would seem to be an ideal fit to run opportunistically on such transient resources in the cloud. In reality, however, modern distributed data processing systems such as MapReduce or Spark are designed to run as the primary task on dedicated hardware, and they perform badly on transiently available resources because of the excessive cost of cascading re-computations in case of evictions.

In this paper, we propose a new framework for big data analytics on transient resources. Specifically, we design and implement TR-Spark, a version of Spark that can run highly efficiently as a secondary background task on transient (evictable) resources. The design of TR-Spark is based on two principles: resource stability and data size reduction-aware scheduling and lineage-aware checkpointing. The combination of these principles allows TR-Spark to naturally adapt to the stability characteristics of the underlying compute infrastructure. Evaluation results show that while regular Spark effectively fails to finish a job in clusters of even moderate instability, TR-Spark performs nearly as well as Spark running on stable resources.

References

[1]

http://aws.amazon.com/about-aws/whats-new/2015/10/introducing-amazon-ec2-spot-instances-for-specific-duration-workloads/.

[2]

http://docs.aws.amazon.com/awsec2/latest/userguide/ebs-ec2-config.html/.

[3]

https://aws.amazon.com/ec2/spot/.

[4]

https://azure.microsoft.com/en-us/services/batch/.

[5]

https://cloud.google.com/compute/docs/instances/preemptible.

[6]

https://flink.apache.org/.

[7]

http://spark.apache.org/.

[8]

http://spark.apache.org/docs/latest/job-scheduling.html.

[9]

http://www.tpc.org/tpcds/.

[10]

G. Ananthanarayanan, C. Douglas, R. Ramakrishnan, S. Rao, and I. Stoica. True elasticity in multi-tenant data-intensive compute clusters. In ACM Symposium on Cloud Computing, SOCC '12, page 24, 2012.

Digital Library

[11]

C. Binnig, A. Salama, E. Zamanian, M. El-Hindi, S. Feil, and T. Ziegler. Spotgres - parallel data analytics on spot instances. In ICDE Workshops, pages 14--21, 2015.

[12]

R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2), 2008.

Digital Library

[13]

N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. N. Tantawi, and C. Krintz. See spot run: Using spot instances for mapreduce workflows. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, 2010.

Digital Library

[14]

S. Di, Y. Robert, F. Vivien, D. Kondo, C. Wang, and F. Cappello. Optimization of cloud task processing with checkpoint-restart mechanism. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, pages 64:1--64:12, 2013.

Digital Library

[15]

S. K. Garg, R. Buyya, and H. J. Siegel. Scheduling parallel applications on utility grids: Time and cost trade-off management. In Computer Science 2009, Thirty-Second Australasian Computer Science Conference (ACSC 2009), pages 139--147, 2009.

Digital Library

[16]

A. Goder, A. Spiridonov, and Y. Wang. Bistro: Scheduling data-parallel jobs against live production systems. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, pages 459--471, 2015.

Digital Library

[17]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 383--397, 2015.

Digital Library

[18]

I. Goiri, F. Julià, J. Guitart, and J. Torres. Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In IEEE/IFIP Network Operations and Management Symposium, NOMS 2010, pages 455--462, 2010.

[19]

Y. Gong, B. He, and A. C. Zhou. Monetary cost optimizations for mpi-based HPC applications on amazon clouds: checkpoints and replicated execution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pages 32:1--32:12, 2015.

Digital Library

[20]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2007 EuroSys Conference, pages 59--72, 2007.

Digital Library

[21]

D. Jung, J. Lim, H. Yu, and T. Suh. Estimated interval-based checkpointing (EIC) on spot instances in cloud computing. J. Applied Mathematics, 2014:217547:1--217547:12, 2014.

[22]

H. Kllapi, E. Sitaridi, M. M. Tsangaris, and Y. E. Ioannidis. Schedule optimization for data processing flows on the cloud. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pages 289--300, 2011.

Digital Library

[23]

Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, pages 25--36, 2012.

Digital Library

[24]

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, pages 450--462, 2015.

Digital Library

[25]

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Improving resource efficiency at scale with heracles. ACM Trans. Comput. Syst., 34(2):6, 2016.

Digital Library

[26]

A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou, K. Vahi, K. Blackburn, D. Meyers, and M. Samidi. Scheduling data-intensiveworkflows onto storage-constrained distributed resources. In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), pages 401--409, 2007.

Digital Library

[27]

B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. C. Murthy, and C. Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1357--1369, 2015.

Digital Library

[28]

A. Salama, C. Binnig, T. Kraska, and E. Zamanian. Cost-based fault-tolerance for parallel data processing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31-June 4, 2015, pages 285--297, 2015.

Digital Library

[29]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems (EuroSys), pages 351--364, Prague, Czech Republic, 2013.

Digital Library

[30]

P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. Flint: Batch-interactive data-intensive processing on transient servers. In Proceedings of the European Conference on Computer Systems (EuroSys), 2016.

Digital Library

[31]

P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 241--252, 2011.

Digital Library

[32]

A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015.

Digital Library

[33]

C. Xu, M. Holzemer, M. Kaul, and V. Markl. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, pages 613--624, 2016.

[34]

H. Yang, A. D. Breslow, J. Mars, and L. Tang. Bubble-flux: precise online qos management for increased utilization in warehouse scale computers. In The 40th Annual International Symposium on Computer Architecture, ISCA'13, pages 607--618, 2013.

Digital Library

[35]

S. Yi, A. Andrzejak, and D. Kondo. Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Trans. Services Computing, 5(4):512--524, 2012.

Digital Library

[36]

S. Yi, D. Kondo, and A. Andrzejak. Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In IEEE International Conference on Cloud Computing, CLOUD 2010, pages 236--243, 2010.

Digital Library

[37]

M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In European Conference on Computer Systems, Proceedings of the 5th European conference on Computer systems, EuroSys 2010, pages 265--278, 2010.

Digital Library

[38]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, pages 15--28, 2012.

Digital Library

[39]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, pages 15--28, 2012.

Digital Library

[40]

M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, pages 29--42, 2008.

Digital Library

[41]

Q. Zhang, E. Gürses, R. Boutaba, and J. Xiao. Dynamic resource allocation for spot markets in clouds. In USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE'11, 2011.

Digital Library

[42]

A. C. Zhou, B. He, X. Cheng, and C. T. Lau. A declarative optimization engine for resource provisioning of scientific workflows in iaas clouds. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, pages 223--234, 2015.

Digital Library

Cited By

Kim KLee KChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645548
Putra RCorneo LWong WDi Francesco M(2024)CLAIM: A cloud-based framework for Internet-scale measurementsNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575763(1-6)Online publication date: 6-May-2024
https://doi.org/10.1109/NOMS59830.2024.10575763
Liu JYan LYan CQiu YJiang CLi YLi YCérin C(2023)Escope: An Energy Efficiency Simulator for Internet Data CentersEnergies10.3390/en1607318716:7(3187)Online publication date: 31-Mar-2023
https://doi.org/10.3390/en16073187
Show More Cited By

Index Terms

TR-Spark: Transient Computing for Big Data Analytics
1. Hardware
  1. Hardware test
  2. Robustness

Recommendations

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based ...
Intelligent RDD Management for High Performance In-Memory Computing in Spark
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

Spark is a pervasively used in-memory computing framework in the era of big data, and can greatly accelerate the computation speed by wrapping the accessed data as resilient distribution datasets (RDDs) and storing these datasets in the fast accessed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

October 2016

534 pages

ISBN:9781450345255

DOI:10.1145/2987550

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '16

Sponsor:

SoCC '16: ACM Symposium on Cloud Computing

October 5 - 7, 2016

CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

76
Total Citations
View Citations
1,064
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim KLee KChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Making Cloud Spot Instance Interruption Events VisibleProceedings of the ACM Web Conference 202410.1145/3589334.3645548(2998-3009)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645548
Putra RCorneo LWong WDi Francesco M(2024)CLAIM: A cloud-based framework for Internet-scale measurementsNOMS 2024-2024 IEEE Network Operations and Management Symposium10.1109/NOMS59830.2024.10575763(1-6)Online publication date: 6-May-2024
https://doi.org/10.1109/NOMS59830.2024.10575763
Liu JYan LYan CQiu YJiang CLi YLi YCérin C(2023)Escope: An Energy Efficiency Simulator for Internet Data CentersEnergies10.3390/en1607318716:7(3187)Online publication date: 31-Mar-2023
https://doi.org/10.3390/en16073187
Jiang HZhang XJoe-Wong C(2023)DOLL: Distributed OnLine Learning Using Preemptible Cloud Instances2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt)10.23919/WiOpt58741.2023.10349831(175-182)Online publication date: 24-Aug-2023
https://doi.org/10.23919/WiOpt58741.2023.10349831
Wang YYu JYu Z(2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术：综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
https://doi.org/10.1631/FITEE.2100298
Zhadan AAllahverdyan AKondratov IMikheev VPetrosian ORomanovskii AKharin V(2023)Multi-agent Reinforcement Learning-based Adaptive Heterogeneous DAG SchedulingACM Transactions on Intelligent Systems and Technology10.1145/361030014:5(1-26)Online publication date: 3-Oct-2023
https://dl.acm.org/doi/10.1145/3610300
Liang XYao LWu SLi YXu Y(2023)CARE: A Cost-AwaRe Eviction Strategy for Improving Throughput in Cloud Environments2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00305(2269-2276)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00305
Arunan SAmarasinghe GPerera I(2023)Cost-optimized scheduling for Microservices in Kubernetes2023 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom59040.2023.00032(131-138)Online publication date: 4-Dec-2023
https://doi.org/10.1109/CloudCom59040.2023.00032
Nunes AMelo ATadonki CBoeres Cde Oliveira Dde Assumpção L(2023)Optimizing computational costs of Spark for SARS‐CoV‐2 sequences comparisons on a commercial cloudConcurrency and Computation: Practice and Experience10.1002/cpe.767835:18Online publication date: Mar-2023
https://doi.org/10.1002/cpe.7678
Winter CGiceva JNeumann TKemper A(2022)On-demand state separation for cloud data warehousingProceedings of the VLDB Endowment10.14778/3551793.355184515:11(2966-2979)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551845
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents