Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2484838.2484856acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Accelerating gene context analysis using bitmaps

Published: 29 July 2013 Publication History

Abstract

Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called "gene cassettes", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer.
We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a "killer query", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer "killer queries" in seconds. Even with an extremely complex "killer query" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.

References

[1]
P. M. Bowers, M. Pellegrini, M. J. Thompson, J. Fierro, T. O. Yeates, D. Eisenberg, et al. Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol, 5(5):R35, 2004.
[2]
R. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, and et al. The pfam protein families database. Nucleic Acids Research, 38:D211--D222, January 2010.
[3]
S. Helmer and G. Moerkotte. A performance study of four index structures for set-valued attributes of low cardinality. The VLDB Journal, 12(3):244--261, 2003.
[4]
N. Mamoulis. Efficient processing of joins on set-valued attributes. In SIGMOD, pages 157--168. ACM, 2003.
[5]
V. Markowitz, I. Chen, K. Palaniappan, K. Chu, E. Szeto, and et al. Img: the integrated microbial genomes database and comparative analysis system. Nucleic Acids Res., 40:D115--D122, 2012. See also http://img.jgi.doe.gov/.
[6]
K. Mavromatis, K. Chu, N. Ivanova, S. Hooper, V. Markowitz, and N. Kyrpides. Gene context analysis in the integrated microbial genomes (IMG) data management system. PLoS ONE, 4(11):e7979, 2009.
[7]
R. Overbeek, M. Fonstein, M. D'Souza, G. Pusch, and N. Maltsev. The use of gene clusters to infer functional coupling. PNAS, 96(6):2896--901, 2009.
[8]
R. Tatusov, N. Fedorova, J. Jackson, A. Jacobs, B. Kiryutin, and et al. The cog database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41, 2003.
[9]
K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high cardinality attributes. In On the Performance of Bitmap Indices for High Cardinality Attributes, pages 24--35, 2004.
[10]
K. Wu, E. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31:1--38, 2006.
[11]
K. Wu, A. Shoshani, and K. Stockinger. Analyses of multi-level and multi-component compressed bitmap indexes. ACM TODS, 35, 2010.
[12]
K. Wu, K. Stockinger, and A. Shoshani. Breaking the curse of cardinality on bitmap indexes. In Statistical and Scientific Database Management -- SSDBM, pages 348--365, 2008.

Cited By

View all
  • (2020)High Performance Queries Using Compressed Bitmap IndexesEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_38(493-505)Online publication date: 29-May-2020
  • (2020)Optimizing bitmap index encoding for high performance queriesConcurrency and Computation: Practice and Experience10.1002/cpe.594333:18Online publication date: 7-Sep-2020
  • (2018)Fault-Tolerant Query Execution over Distributed Bitmap Indices2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)10.1109/BDCAT.2018.00012(21-30)Online publication date: Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
July 2013
401 pages
ISBN:9781450319218
DOI:10.1145/2484838
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 July 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SSDBM '13

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2020)High Performance Queries Using Compressed Bitmap IndexesEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_38(493-505)Online publication date: 29-May-2020
  • (2020)Optimizing bitmap index encoding for high performance queriesConcurrency and Computation: Practice and Experience10.1002/cpe.594333:18Online publication date: 7-Sep-2020
  • (2018)Fault-Tolerant Query Execution over Distributed Bitmap Indices2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)10.1109/BDCAT.2018.00012(21-30)Online publication date: Dec-2018
  • (2016)IMG/M: integrated genome and metagenome comparative data analysis systemNucleic Acids Research10.1093/nar/gkw92945:D1(D507-D516)Online publication date: 13-Oct-2016
  • (2014)Maintaining a microbial genome & metagenome data analysis system in an academic settingProceedings of the 26th International Conference on Scientific and Statistical Database Management10.1145/2618243.2618244(1-11)Online publication date: 30-Jun-2014
  • (2014)Managing PMU data sets with bitmap indexes2014 IEEE Conference on Technologies for Sustainability (SusTech)10.1109/SusTech.2014.7046247(219-224)Online publication date: Jul-2014
  • (2014)Exploratory Analysis of Raw Data Files through DataflowsProceedings of the 2014 International Symposium on Computer Architecture and High Performance Computing Workshop10.1109/SBAC-PADW.2014.32(114-119)Online publication date: 22-Oct-2014

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media