Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2016741.2016780acmotherconferencesArticle/Chapter ViewAbstractPublication PagestgConference Proceedingsconference-collections
research-article

Benefits of NoSQL databases for portals & science gateways

Published: 18 July 2011 Publication History

Abstract

Portals and gateways are increasingly offering users complex interfaces to interact with massive data sets. As dealing with big data becomes more commonplace, portal and gateway developers need to readdress how data is stored and rethink the supporting infrastructure that enables quick and simple access and analysis of data. It is becoming evident that traditional, relational databases are not always the most appropriate solution to allow users on-demand access to big data sets. In this study we show that using non-relational, "NoSQL" databases such as key-value stores and document stores can offer large benefits in performance, accessibility, and availability. We present a use case from the TeraGrid User Portal that demonstrates solutions for processing and auditing user job data efficiently in order to provide users rapid access to this data.
One of the goals of TeraGrid User Portal is to offer users and PIs detailed job statistics such as service unit (SU) usage and job history via the user portal interface. While building a portal application to analyze batch job data records in the TeraGrid Central Database (TGCDB), we quickly ran into stumbling blocks. The TGCDB has over 17 million job records from December 2003 through March 2011. Between January 2011 and April 2011 alone, there are over 2.8 million job records. This data is growing at an ever-faster rate and will continue to grow as new computing resources become available. Even properly indexed tables took longer than ideal to query and still be responsive in a portal application. The current solution to this was to cache the jobs query results and access those cached results in the portal. This solved the issue with the speed of the query, but did not address the problem of dealing with this massive data set. We still needed the rich query interface that a database provides.
In order to solve our issues we looked at a two different options. First, we tested moving the TGCDB to a newer, faster machine than the one it currently runs on to determine how much of the bottleneck was due to aging hardware. Second, we tested migrating the jobs data off of the relational PostgreSQL TGCDB and into a key-value store using Apache CouchDB instead of the flat file cache we had been using. CouchDB is a document-oriented database that is queried using MapReduce. CouchDB also offers specific benefits for portals and gateways, providing a RESTful JSON API that can be accessed using HTTP requests.
Our initial tests have shown that moving the TGCDB to new hardware can provide a query speedup of 3.7x on average for the job queries we tested. Querying the same data using MapReduce queries to CouchDB gave an additional 8.24x speedup for a total of 30.6x speedup over the current TGCDB on average. The huge speedups offered by CouchDB come at the cost of additional disk usage. CouchDB maintains B-tree indices on the document store as well as any defined queries or âĂIJviewsâĂİ. These indices use a greater amount of disk than a relational database, but enables CouchDB to take full advantage of high-performance disks and file systems.
We show that the increase in performance gained from using a data warehouse for certain large data sets can offer great benefits to building on-demand data analysis tools in portals and gateways. By identifying these large data sets such as the TeraGrid jobs data and migrating them to high performance data stores such as CouchDB we can make much more information readily available to users.

Cited By

View all
  • (2013)Document-Based Databases for Medical Information Systems and Crisis ManagementInternational Journal of Information Systems for Crisis Response and Management10.4018/ijiscram.20130701045:3(63-80)Online publication date: 1-Jul-2013

Index Terms

  1. Benefits of NoSQL databases for portals & science gateways

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    TG '11: Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
    July 2011
    256 pages
    ISBN:9781450308885
    DOI:10.1145/2016741

    Sponsors

    • University of Illinois: University of Illinois

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CouchDB
    2. MapReduce
    3. NoSQL
    4. PostgreSQL
    5. big data
    6. data
    7. databases

    Qualifiers

    • Research-article

    Conference

    TG'11
    Sponsor:
    • University of Illinois
    TG'11: TeraGrid 2011
    July 18 - 21, 2011
    Utah, Salt Lake City

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2013)Document-Based Databases for Medical Information Systems and Crisis ManagementInternational Journal of Information Systems for Crisis Response and Management10.4018/ijiscram.20130701045:3(63-80)Online publication date: 1-Jul-2013

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media