Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126908.3126934acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Optimizing the query performance of block index through data analysis and I/O modeling

Published: 12 November 2017 Publication History

Abstract

Indexing technique has become an efficient tool to enable scientists to directly access the most relevant data records. But, the time and space requirements of building and storing indexes are expensive in the traditional approaches, such as R-tree and bitmaps. Recently, we started to address this issue by using the idea of "block index", and our previous work has shown promising results from comparing it against other well-known solutions, including ADIOS, SciDB, and FastBit. In this work, we further improve the technique from both theoretical and implementation perspectives. Driven by an extensive effort in characterizing scientific datasets and modeling I/O systems, we presented a theoretical model to analyze its query performance with respect to a given block size configuration. We also introduced three optimization techniques to achieve a 2.3x query time reduction comparing to the original implementation.

References

[1]
IPCC Fifth Assessment Report. http://en.wikipedia.org/wiki/IPCC_Fifth_Assessment-Report.
[2]
B. Behzad, H. V. T. Luu, J. Huchette, S. Byna, Prabhat, R. Aydt, Q. Koziol, and M. Snir. Taming parallel i/o complexity with auto-tuning. In SC, pages 68:1--68:12, 2013.
[3]
S. B. Bin Dong and K. Wu. "spatially clustered join on heterogeneous scientific data sets". In 2015 IEEE International Conference on Big Data (IEEE BigData 2015), 2015.
[4]
K.J. Bowers, B.J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan. Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulation. Physics of Plasmas, 15(5):7, 2008.
[5]
S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, W. S. Daughton, V. Roytershteyn, E. W. Bethel, M. Howison, K.-J. Hsu, K.-W. Lin, A. Shoshani, A. Uselton, and K. Wu. Parallel I/O, analysis, and visualization of a trillion particle simulation. In SC, page 59, 2012.
[6]
C. Chen, X. Huang, H. Fu, and G. Yang. The chunk-locality index: An efficient query method for climate datasets. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 2104--2110, May 2012.
[7]
J. Chou, M. Howison, B. Austin, K. Wu, J. Qiang, E. W. Bethel, A. Shoshani, O. Rbel, and P. R. D. Ryne. Parallel index and query for large scale data analysis. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1--11, Nov 2011.
[8]
D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121--137, June 1979.
[9]
P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A Demonstration of SciDB: A Science-oriented DBMS. Proc. VLDB Endow., 2(2):1534--1537, Aug. 2009.
[10]
G. S. Davidson, K. W. Boyack, R. A. Zacharski, S. C. Helmerich, and J. R. Cowie. Data-centric computing with the netezza architecture. Technical Report SAND2006-3640, Sandia National Laboratory, 2006.
[11]
A. Herrera. Minmax indexes. pg hackers.
[12]
Apache hive. https://hive.apache.org/.
[13]
S. Klasky, H. Abbasi, et al. In Situ Data Processing for Extreme-Scale Computing. In SciDAC, July 2011.
[14]
ADIOS. http://www.nccs.gov/user-support/center-projects/adios/.
[15]
S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In HPDC, pages 1--12, 2013.
[16]
S. Lakshminarasimhan, J. Jenkins, et al. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data. In SC, pages 1--11, Nov 2011.
[17]
S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. Ross, and N. F. Samatova. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data. In Euro-Par, pages 366--379, 2011.
[18]
K.-L. Ma. In situ visualization at extreme scale: Challenges and opportunities. Computer Graphics and Applications, IEEE, 29(6):14--19, Nov 2009.
[19]
A. Nanda. Smart scans meet storage indexes. Oracle Magazine, 2011.
[20]
P. O'Neil. Model 204 architecture and performance. In 2nd International Workshop in High Performance Transaction Systems, Asilomar, CA, volume 359 of Lecture Notes in Computer Science, pages 40--59. Springer-Verlag, Sept. 1987.
[21]
P. O'Neil and E. O'Neil. Database: principles, programming, and performance. Morgan Kaugmann, 2nd edition, 2000.
[22]
A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2010.
[23]
K. Stockinger, E. W. Bethel, S. Campbell, E. Dart, and K. Wu. Detecting Distributed Scans Using High-Performance Query-Driven Visualization. In SC. IEEE Computer Society Press, Nov. 2006.
[24]
K. Stockinger, J. Shalf, W. Bethel, and K. Wu. Query-driven visualization of large data sets. In IEEE Visualization 2005, Minneapolis, MN, October 23-28, 2005, page 22, 2005.
[25]
The HDF Group. HDF5 user guide. http://hdf.ncsa.uiuc.edu/HDF5/doc/H5.user.html, 2010.
[26]
T. Tu, H. Yu, et al. Remote runtime steering of integrated terascale simulation and visualization. In SC HPC Analytics Challenge, 2006.
[27]
Unidata. The NetCDF users' guide. http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/, 2010.
[28]
K. Wu, S. Ahern, et al. FastBit: Interactively searching massive data. In SciDAC, 2009.
[29]
T. Wu, J. Chou, N. Podhorszki, J. Gu, Y. Tian, S. Klasky, and K. Wu. Apply block index technique to scientific data analysis and i/o systems. In IEEE/ACM International Workshop on Distributed Big Data Management (DBDM) at CCGrid, May 2017.
[30]
T. Wu, H. Shyng, J. Chou, B. Dong, and K. Wu. Indexing blocks to reduce space and time requirements for searching large data files. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 398--402, May 2016.

Cited By

View all
  • (2023)I/O Access Patterns in HPC Applications: A 360-Degree SurveyACM Computing Surveys10.1145/361100756:2(1-41)Online publication date: 15-Sep-2023
  • (2022)EMPRESS: Accelerating Scientific Discovery through Descriptive Metadata ManagementACM Transactions on Storage10.1145/352369818:4(1-49)Online publication date: 27-Sep-2022
  • (2020)Parallel Query Service for Object-centric Data Management Systems2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00076(406-415)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O system
  2. indexing
  3. modeling
  4. performance analysis
  5. scientific data

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)12
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)I/O Access Patterns in HPC Applications: A 360-Degree SurveyACM Computing Surveys10.1145/361100756:2(1-41)Online publication date: 15-Sep-2023
  • (2022)EMPRESS: Accelerating Scientific Discovery through Descriptive Metadata ManagementACM Transactions on Storage10.1145/352369818:4(1-49)Online publication date: 27-Sep-2022
  • (2020)Parallel Query Service for Object-centric Data Management Systems2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00076(406-415)Online publication date: May-2020
  • (2020)Predicting and Comparing the Performance of Array Management Libraries2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00097(906-915)Online publication date: May-2020
  • (2020)Design and Implementation of the Tianhe-2 Data Storage and Management SystemJournal of Computer Science and Technology10.1007/s11390-020-9799-435:1(27-46)Online publication date: 17-Jan-2020
  • (2019)IndexIt: Enhancing Data Locating Services for Parallel File Systems2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00145(1011-1019)Online publication date: Aug-2019
  • (2019)UniIndex: An index and query middleware for parallel file systemsConcurrency and Computation: Practice and Experience10.1002/cpe.560932:9Online publication date: 19-Dec-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media