Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2465848.2465849acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Published: 17 June 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.

    References

    [1]
    10gen, The MongoDB Company. http://www.10gen.com.
    [2]
    S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT '97, pages 1--18, London, UK, UK, 1997. Springer-Verlag.
    [3]
    Apache Hadoop. http://hadoop.apache.org.
    [4]
    Apache HBase. http://hbase.apache.org.
    [5]
    K. Bakshi. Considerations for big data: Architecture and approach. In Aerospace Conference, 2012 IEEE, pages 1--7, march 2012.
    [6]
    Binary JSON. http://bsonspec.org/.
    [7]
    L. Bonnet, A. Laurent, M. Sala, B. Laurent, and N. Sicard. Reduce, you say: What can do for data aggregation and bi in large repositories. In Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications, DEXA '11, pages 483--488, Washington, DC, USA, 2011. IEEE Computer Society.
    [8]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15, Berkeley, CA, USA, 2006. USENIX Association.
    [9]
    B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008.
    [10]
    B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 143--154, New York, NY, USA, 2010. ACM.
    [11]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
    [12]
    T. Dory, B. Mej Aas, P. V. Roy, and N.-L. Tran. Measuring elasticity for cloud databases. In Proceedings of the The Second International Conference on Cloud Computing, GRIDs, and Virtualization, 2011.
    [13]
    Z. Fadika and M. Govindaraju. Lemo-mr: Low overhead and elastic mapreduce implementation optimized for memory and cpu-intensive applications. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '10, pages 1--8, Washington, DC, USA, 2010. IEEE Computer Society.
    [14]
    A. Floratou, N. Teletia, D. Dewitt, J. Patel, and D. Z. Zhang. Can the elephants handle the nosql onslaught? VLDB, 2012.
    [15]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM.
    [16]
    A. Lakshman and P. Malik. Cassandra: structured storage system on a p2p network. In Proceedings of the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 5--5, New York, NY, USA, 2009. ACM.
    [17]
    The Materials Project. http://materialsproject.org.
    [18]
    MongoDB. http://www.mongodb.org.
    [19]
    MongoDB + Hadoop Connector. http://api.mongodb.org/hadoop/.
    [20]
    E. Plugge, T. Hawkins, and P. Membrey. The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing. Apress, Berkely, CA, USA, 1st edition, 2010.
    [21]
    J. Pokorny. Nosql databases: a step to database scalability in web environment. In Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS '11, pages 278--283, New York, NY, USA, 2011. ACM.
    [22]
    Spider Monkey. https://developer.mozilla.org/en/SpiderMonkey.
    [23]
    The TPC-H Benchmark. http://www.tpc.org/tpch/.
    [24]
    A. Verma, X. Llora, S. Venkataraman, D. Goldberg, and R. Campbell. Scaling ecga model building via data-intensive computing. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1--8, july 2010.

    Cited By

    View all
    • (2023)Early Stopping of Non-productive Performance Testing Experiments Using Measurement Mutations2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA60479.2023.00022(86-93)Online publication date: 6-Sep-2023
    • (2023)An Open Approach to Autonomous Ran Fault ManagementIEEE Wireless Communications10.1109/MWC.004.220024430:1(96-102)Online publication date: Mar-2023
    • (2023)Towards Eco-Sustainability and Green Analytics Model to Measure the Performance of Big Data Systems2023 International Conference on Information Management (ICIM)10.1109/ICIM58774.2023.00007(1-6)Online publication date: Mar-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Science Cloud '13: Proceedings of the 4th ACM workshop on Scientific cloud computing
    June 2013
    64 pages
    ISBN:9781450319799
    DOI:10.1145/2465848
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. MapReduce
    3. MongoDB
    4. NoSQL
    5. distributed computing
    6. scientific computing

    Qualifiers

    • Research-article

    Conference

    HPDC'13
    Sponsor:

    Acceptance Rates

    Science Cloud '13 Paper Acceptance Rate 7 of 14 submissions, 50%;
    Overall Acceptance Rate 44 of 151 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)83
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Early Stopping of Non-productive Performance Testing Experiments Using Measurement Mutations2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)10.1109/SEAA60479.2023.00022(86-93)Online publication date: 6-Sep-2023
    • (2023)An Open Approach to Autonomous Ran Fault ManagementIEEE Wireless Communications10.1109/MWC.004.220024430:1(96-102)Online publication date: Mar-2023
    • (2023)Towards Eco-Sustainability and Green Analytics Model to Measure the Performance of Big Data Systems2023 International Conference on Information Management (ICIM)10.1109/ICIM58774.2023.00007(1-6)Online publication date: Mar-2023
    • (2022)Face-Crypt Messenger: Enhancing Security of Messaging Systems using AI based Facial Recognition and Encryption2022 6th International Conference on Computing Methodologies and Communication (ICCMC)10.1109/ICCMC53470.2022.9753712(174-180)Online publication date: 29-Mar-2022
    • (2022)B-EagleV: Visualization of Big Point Cloud Datasets in Civil Engineering Using a Distributed Computing SolutionJournal of Computing in Civil Engineering10.1061/(ASCE)CP.1943-5487.000102136:3Online publication date: May-2022
    • (2022)BDWatchdogFuture Generation Computer Systems10.1016/j.future.2017.12.06887:C(420-437)Online publication date: 21-Apr-2022
    • (2021)AI-Enabled Efficient and Safe Food Supply ChainElectronics10.3390/electronics1011122310:11(1223)Online publication date: 21-May-2021
    • (2021)MongoDB: Analysis of Performance with Data from the National High School Exam (Enem)2021 16th Iberian Conference on Information Systems and Technologies (CISTI)10.23919/CISTI52073.2021.9476248(1-6)Online publication date: 23-Jun-2021
    • (2021)Low Latency and High Throughput Write-Ahead Logging Using CAPI-FlashIEEE Transactions on Cloud Computing10.1109/TCC.2019.29066139:3(1129-1142)Online publication date: 1-Jul-2021
    • (2021)A comparative experimental study of distributed storage engines for big spatial data processing using GeoSparkThe Journal of Supercomputing10.1007/s11227-021-03946-7Online publication date: 1-Jul-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media