research-article

Lessons learned from a year's worth of benchmarks of large data clouds

Authors:

Robert L. GrossmanAuthors Info & Claims

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

Article No.: 3, Pages 1 - 6

https://doi.org/10.1145/1646468.1646471

Published: 16 November 2009 Publication History

Abstract

In this paper, we discuss some of the lessons that we have learned working with the Hadoop and Sector/Sphere systems. Both of these systems are cloud-based systems designed to support data intensive computing. Both include distributed file systems and closely coupled systems for processing data in parallel. Hadoop uses MapReduce, while Sphere supports the ability to execute an arbitrary user defined function over the data managed by Sector. We compare and contrast these systems and discuss some of the design trade-offs necessary in data intensive computing. In our experimental studies over the past year, Sector/Sphere has consistently performed about 2--4 times faster than Hadoop. We discuss some of the reasons that might be responsible for this difference in performance.

References

[1]

Collin Bennett, Robert Grossman, and Jonathan Seidman, Open Cloud Consortium Technical Report TR-09-01, MalStone: A Benchmark for Data Intensive Computing, Apr. 2009.

[2]

Beynon, Michael D. and Kurc, Tahsin and Catalyurek, Umit and Chang, Chialin and Sussman, Alan and Saltz, Joel, Distributed processing of very large datasets with DataCutter, Journal of Parallel Computing, Vol. 27, 2001. Pages 1457--1478.

Digital Library

[3]

J. Bent, D. Thain, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, "Explicit control in a batch-aware distributed file system," in Proceedings of the First USENIX/ACM Conference on Networked Systems Design and Implementation, March 2004.

Digital Library

[4]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.

Digital Library

[5]

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.

Digital Library

[6]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, pub. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

Digital Library

[7]

Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta, VL2: A Scalable and Flexible Data Center Network, SIGCOMM 2009.

Digital Library

[8]

Yunhong Gu and Robert Grossman, Exploring Data Parallelism and Locality in Wide Area Networks, Workshop on Many-task Computing on Grids and Supercomputers (MTAGS), co-located with SC08, Austin, TX. Nov. 2008

[9]

Yunhong Gu, Robert Grossman, UDT: UDP-based data transfer for high-speed networks, Computer Networks (Elsevier), Volume 51, Issue 7. May 2007.

Digital Library

[10]

Yunhong Gu, Robert L. Grossman, Alex Szalay and Ani Thakar, Distributing the Sloan Digital Sky Survey Using UDT and Sector, Proceedings of e-Science 2006.

Digital Library

[11]

Tevfik Kosar and Miron Livny, Stork: Making Data Placement a First Class Citizen in the Grid, in Proceedings of 24th IEEE International Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, March 2004.

Digital Library

[12]

T. Kurc, Umit Catalyurek, C. Chang, A. Sussman, and J. Salz. Exploration and visualization of very large datasets with the Active Data Repository. Technical Report CS-TR4208, University of Maryland, 2001.

[13]

I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, and B. Clifford, Toward Loosely Coupled Programming on Petascale Systems, Proceedings of the 20th ACM/IEEE Conference on Supercomputing.

Digital Library

[14]

Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2--4, pages 323--356, February-April, 2005.

Digital Library

[15]

Hadoop, hadoop.apache.org/core, Retrieved in Oct. 2009.

[16]

The Open Cloud Testbed, http://www.opencloudconsortium.org.

[17]

Ioan Raicu, Ian Foster, Yong Zhao, Many-Task Computing for Grids and Supercomputers, Workshop on Many-task Computing on Grids and Supercomputers (MTAGS), co-located with SC08, Austin, TX. Nov. 2008

Cited By

Jlassi AMartineau POssowski S(2016)Virtualization technologies for the big data environmentProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851881(542-545)Online publication date: 4-Apr-2016
https://dl.acm.org/doi/10.1145/2851613.2851881
Middleton ABayliss DHalliday GChala AFurht BFurht BVillanustre F(2016)The HPCC/ECL Platform for Big DataBig Data Technologies and Applications10.1007/978-3-319-44550-2_6(159-183)Online publication date: 17-Sep-2016
https://doi.org/10.1007/978-3-319-44550-2_6
Nguyen PSimon THalem MChapman DLe Q(2012)A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce EnvironmentProceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing10.1109/UCC.2012.32(161-167)Online publication date: 5-Nov-2012
https://dl.acm.org/doi/10.1109/UCC.2012.32
Show More Cited By

Index Terms

Lessons learned from a year's worth of benchmarks of large data clouds

Recommendations

Towards a framework for large-scale multimedia data storage and processing on Hadoop platform

Cloud computing techniques take the form of distributed computing by utilizing multiple computers to execute computing simultaneously on the service side. To process the increasing quantity of multimedia data, numerous large-scale multimedia data ...
MapReduce in the Clouds for Science
CLOUDCOM '10: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable alternative to traditional servers and computing clusters. MapReduce distributed data processing architecture has ...
Decentralized Edge Clouds

Cloud computing services are traditionally deployed on centralized computing infrastructures confined to a few data centers, while cloud applications run in a single data center. However, the cloud's centralized nature can be limiting in terms of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

November 2009

131 pages

ISBN:9781605587141

DOI:10.1145/1646468

Conference Chairs:
Ioan Raicu
Northwestern University
,
Ian Foster
University of Chicago & Argonne National Laboratory
,
Yong Zhao
Microsoft

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '09

Sponsor:

SIGARCH

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 16, 2009

Oregon, Portland

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
471
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jlassi AMartineau POssowski S(2016)Virtualization technologies for the big data environmentProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851881(542-545)Online publication date: 4-Apr-2016
Middleton ABayliss DHalliday GChala AFurht BFurht BVillanustre F(2016)The HPCC/ECL Platform for Big DataBig Data Technologies and Applications10.1007/978-3-319-44550-2_6(159-183)Online publication date: 17-Sep-2016
Nguyen PSimon THalem MChapman DLe Q(2012)A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce EnvironmentProceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing10.1109/UCC.2012.32(161-167)Online publication date: 5-Nov-2012
Ghoshal DRamakrishnan L(2012)FRIEDAProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.132(1096-1105)Online publication date: 10-Nov-2012
Fadika ZGovindaraju MCanon RRamakrishnan L(2012)Evaluating Hadoop for Data-Intensive Scientific OperationsProceedings of the 2012 IEEE Fifth International Conference on Cloud Computing10.1109/CLOUD.2012.118(67-74)Online publication date: 24-Jun-2012
Nguyen PHalem MMattmann CMevidovic NMohan TO'Malley O(2011)A MapReduce workflow system for architecting scientific data intensive applicationsProceedings of the 2nd International Workshop on Software Engineering for Cloud Computing10.1145/1985500.1985510(57-63)Online publication date: 22-May-2011
Sakr SLiu ABatista DAlomari M(2011)A Survey of Large Scale Data Management Approaches in Cloud EnvironmentsIEEE Communications Surveys & Tutorials10.1109/SURV.2011.032211.0008713:3(311-336)Online publication date: 2011
Middleton ABayliss DHalliday G(2011)ECL/HPCC: A Unified Approach to Big DataHandbook of Data Intensive Computing10.1007/978-1-4614-1415-5_3(59-107)Online publication date: 11-Nov-2011
Hwang KDongarra JFox G(2011)Distributed and Cloud ComputingundefinedOnline publication date: 31-Oct-2011
Gu YLu LGrossman RYoo A(2010)Processing massive sized graphs using Sector/Sphere2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers10.1109/MTAGS.2010.5699427(1-10)Online publication date: Nov-2010
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents