Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2132876.2132889acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

MATE-EC2: a middleware for processing data with AWS

Published: 14 November 2011 Publication History
  • Get Citation Alerts
  • Abstract

    Recently, there has been growing interest in using Cloud resources for a variety of high performance and data-intensive applications. While there is currently a number of commercial Cloud service providers, Amazon Web Services (AWS) appears to be the most widely used. One of the main services that AWS offers is the Simple Storage Service (S3) for unbounded reliable storage of data, which is particularly amenable to data-intensive processes. Certainly, for these types of applications, we need support for effective retrieval and processing of data stored in S3 environments.
    In this paper, we focus on parallel and scalable processing of data stored in S3 using compute instances in AWS. We describe a middleware that allows the specification of data processing using a high-level API, which is a variant of the Map-Reduce paradigm. We show various optimizations, including data organization, job assignment, and data retrieval strategies, that can be leveraged based on the performance characteristics of S3. Our middleware is also capable of effectively using a heterogeneous collection of EC2 instances for data processing. Our detailed experimental study further evaluates what factors impact efficiency of retrieving and processing S3 data. We compare our middleware with Amazon Elastic Map-Reduce and show how we determine the best configuration for data processing on AWS.

    References

    [1]
    I. F. Adams, D. D. Long, E. L. Miller, S. Pasupathy, and M. W. Storer. Maximizing efficiency by trading storage for computation. In Proc. of the Workshop on Hot Topics in Cloud Computing (HotCloud), 2009.
    [2]
    C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems (NIPS), pages 281--288. MIT Press, 2006.
    [3]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
    [4]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137--150, 2004.
    [5]
    E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: the montage example. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
    [6]
    D. Gillick, A. Faria, and J. Denero. Mapreduce: Distributed computing for machine learning. 2008.
    [7]
    Google app engine, http://code.google.com/appengine.
    [8]
    Hadoop, http://hadoop.apache.org/.
    [9]
    W. Jiang, V. Ravi, and G. Agrawal. A Map-Reduce System with an Alternate API for Multi-Core Environments. In Proceedings of Conference on Cluster Computing and Grid (CCGRID), 2010.
    [10]
    R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining, Apr. 2001.
    [11]
    R. Jin and G. Agrawal. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In Proceedings of the second SIAM conference on Data Mining, Apr. 2002.
    [12]
    K. Kambatla, A. Pathak, and H. Pucha. Towards optimizing hadoop provisioning in the cloud. In HotCloud'09: Proceedings of the 2009 conference on Hot topics in cloud computing, Berkeley, CA, USA, 2009. USENIX Association.
    [13]
    D. Kondo, B. Javadi, P. Malecot, F. Cappello, and D. P. Anderson. Cost-benefit analysis of cloud computing versus desktop grids. In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1--12, Washington, DC, USA, 2009. IEEE Computer Society.
    [14]
    H. Lin, X. Ma, J. S. Archuleta, W. chun Feng, M. K. Gardner, and Z. Zhang. Moon: Mapreduce on opportunistic environments. In S. Hariri and K. Keahey, editors, HPDC, pages 95--106. ACM, 2010.
    [15]
    J. Li, et al. escience in the cloud: A modis satellite data reprojection and reduction pipeline in the windows azure platform. In IPDPS '10: Proceedings of the 2010 IEEE International Symposium on Parallel&Distributed Processing, Washington, DC, USA, 2010. IEEE Computer Society.
    [16]
    L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.
    [17]
    M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon s3 for science grids: a viable solution? In DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64, New York, NY, USA, 2008. ACM.
    [18]
    I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers. IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008.
    [19]
    C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of 13th International Conference on High-Performance Computer Architecture (HPCA), pages 13--24. IEEE Computer Society, 2007.
    [20]
    C. Vecchiola, S. Pandey, and R. Buyya. High-performance cloud computing: A view of scientific applications. Parallel Architectures, Algorithms, and Networks, International Symposium on, 0:4--16, 2009.
    [21]
    J. Weissman and S. Ramakrishnan. Using proxies to accelerate cloud applications. In Proc. of the Workshop on Hot Topics in Cloud Computing (HotCloud), 2009.
    [22]
    D. Yuan, Y. Yang, X. Liu, and J. Chen. A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. In IPDPS '10: Proceedings of the 2010 IEEE International Symposium on Parallel&Distributed Processing, Washington, DC, USA, 2010. IEEE Computer Society.
    [23]
    M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008.

    Cited By

    View all
    • (2019)QARPF: A QoS-Aware Active Resource Provisioning Framework Based on OpenStack2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00281(1568-1576)Online publication date: Aug-2019
    • (2012)Time and Cost Sensitive Data-Intensive Computing on Hybrid CloudsProceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)10.1109/CCGrid.2012.95(636-643)Online publication date: 13-May-2012

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
    November 2011
    76 pages
    ISBN:9781450311458
    DOI:10.1145/2132876
    • General Chairs:
    • Ioan Raicu,
    • Ian Foster,
    • Yong Zhao
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. AWS
    2. EC2
    3. S3
    4. cloud computing
    5. mapreduce

    Qualifiers

    • Research-article

    Conference

    SC '11
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)QARPF: A QoS-Aware Active Resource Provisioning Framework Based on OpenStack2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00281(1568-1576)Online publication date: Aug-2019
    • (2012)Time and Cost Sensitive Data-Intensive Computing on Hybrid CloudsProceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)10.1109/CCGrid.2012.95(636-643)Online publication date: 13-May-2012

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media