Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/CCGrid.2016.18acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
research-article

Indexing blocks to reduce space and time requirements for searching large data files

Published: 16 May 2016 Publication History

Abstract

Scientific discoveries are increasingly relying on analysis of massive amounts of data generated from scientific experiments, observations, and simulations. The ability to directly access the most relevant data records, without shifting through all of them becomes essential. While many indexing techniques have been developed to quickly locate the selected data records, the time and space required for building and storing these indexes are often too expensive to meet the demands of in situ or real-time data analysis. Existing indexing methods generally capture information about each individual data record, however, when reading a data record, the I/O system typically has to access a block or a page of data. In this work, we postulate that indexing blocks instead of individual data records could significantly reduce index size and index building time without increasing the I/O time for accessing the selected data records. Our experiments using multiple real datasets on a supercomputer show that block index can reduce query time by a factor of 2 to 50 over other existing methods, including SciDB and FastQuery. But the size of block index is almost negligible comparing to the data size, and the time of building index can reach the peak I/O speed.

References

[1]
IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCC_Fifth_Assessment_Report.
[2]
R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In SIGMOD Workshop on Data Description, Access and Control, pages 107--141, 1970.
[3]
P. G. Brown. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD, pages 963--968, 2010.
[4]
H.-T. Chiu, J. Chou, V. Vishwanath, S. Byna, and K. Wu. Simplifying index file structure to improve I/O performance of parallel indexing. In ICPADS, pages 576--583, Dec 2014.
[5]
H.-T. Chiu, J. Chou, V. Vishwanath, and K. Wu. In-memory query system for scientific datasets. In ICPADS, 2015.
[6]
J. Chou, K. Wu, O. Rübel, M. Howison, J. Qiang, Prabhat, B. Austin, E. W. Bethel, R. D. Ryne, and A. Shoshani. Parallel index and query for large scale data analysis. In SC, 2011.
[7]
D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121--137, June 1979.
[8]
B. Dong, S. Byna, and K. Wu. SDS: A Framework for Scientific Data Services. In PDSW, pages 27--32, 2013.
[9]
D. Giampaolo. Practical File System Design with the Be File System. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 1998.
[10]
A. Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD Rec., 14(2):47--57, June 1984.
[11]
S. J. Karpen. Design and implementation of a real time information storage and retrieval system. In Proceedings of the 1971 26th Annual Conference, ACM '71, pages 37--66, New York, NY, USA, 1971. ACM.
[12]
S. Klasky, H. Abbasi, et al. In Situ Data Processing for Extreme-Scale Computing. In SciDAC, July 2011.
[13]
S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12, 2013.
[14]
S. Lakshminarasimhan, J. Jenkins, et al. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data. In SC, pages 1--11, Nov 2011.
[15]
J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In CLADE'08, pages 15--24, 2008.
[16]
K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities. Computer Graphics and Applications, IEEE, 29(6):14--19, Nov 2009.
[17]
A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2010.
[18]
I. Spiegler and R. Maayan. Storage and retrieval considerations of binary data bases. Inf. Process. Manage., 21(3):233--254, Aug. 1985.
[19]
K. Stockinger, E. W. Bethel, S. Campbell, E. Dart, and K. Wu. Detecting Distributed Scans Using High-Performance Query-Driven Visualization. In SC. IEEE Computer Society Press, Nov. 2006.
[20]
K. Stockinger, J. Shalf, W. Bethel, and K. Wu. Query-driven visualization of large data sets. In IEEE Visualization 2005, Minneapolis, MN, October 23--28, 2005, page 22, 2005.
[21]
T. Tu, H. Yu, et al. Remote runtime steering of integrated terascale simulation and visualization. In SC HPC Analytics Challenge, 2006.
[22]
K. Wu, S. Ahern, et al. FastBit: Interactively searching massive data. In SciDAC, 2009.

Cited By

View all
  • (2017)EmpressProceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems10.1145/3149393.3149403(19-24)Online publication date: 12-Nov-2017
  • (2017)Optimizing the query performance of block index through data analysis and I/O modelingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126934(1-10)Online publication date: 12-Nov-2017
  • (2017)Apply Block Index Technique to Scientific Data Analysis and I/O SystemsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2017.37(865-871)Online publication date: 14-May-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing
May 2016
784 pages
ISBN:9781509024520

Publisher

IEEE Press

Publication History

Published: 16 May 2016

Check for updates

Author Tags

  1. block index and query
  2. parallel i/o
  3. scientific data

Qualifiers

  • Research-article

Conference

CCGrid '16

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2017)EmpressProceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems10.1145/3149393.3149403(19-24)Online publication date: 12-Nov-2017
  • (2017)Optimizing the query performance of block index through data analysis and I/O modelingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126934(1-10)Online publication date: 12-Nov-2017
  • (2017)Apply Block Index Technique to Scientific Data Analysis and I/O SystemsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2017.37(865-871)Online publication date: 14-May-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media