Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1735688.1735705acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Best-effort semantic document search on GPUs

Published: 14 March 2010 Publication History

Abstract

Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.

References

[1]
Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, Kilian Weinberger, "Supervised semantic indexing", Proceedings of the 18th ACM conference on Information and knowledge management (CIKM), November 2009.
[2]
A. Jaleel and M. Mattina and B. Jacob, "Last-level cache (LLC) performance of data-mining workloads on a CMP--A case study of parallel bioinformatics workloads", HPCA 2006.
[3]
Gleim, R. and Mehler, A. and Dehmer, M., "Web Corpus Mining by instance of Wikipedia", Web as Corpus, 2007.
[4]
Chen, Y. K. and Chhugani, J. and Dubey, P. and Hughes, C. J. and Kim, D. and Kumar, S. and Lee, V. W. and Nguyen, A. D. and Smelyanskiy, M. and Smelyanskiy, M., "Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications", In Proceedings of the IEEE, Vol. 96, No. 5, pp. 790--807, 2008.
[5]
Pradeep Dubey, "A Platform 2015 Workload Model: Recognition, Mining and Synthesis Moves Computers to the Era of Tera", White paper, Intel Corporation, 2008.
[6]
Chu, Cheng T. and Kim, Sang K. and Lin, Yi A. and Yu, Yuanyuan and Bradski, Gary R. and Ng, Andrew Y. and Olukotun, Kunle, "Map-Reduce for Machine Learning on Multicore", In NIPS 2006, pp. 281--288, 2006.
[7]
Jiayuan Meng and Srimat Chakradhar and Anand Raghunathan, "Best-Effort Parallel Execution Framework for Recognition and Mining Applications", IPDPS 2009.
[8]
Jiayuan Meng and Srimat Chakradhar, Anand Raghunathan, and Surendra Byna, "Exploiting the Forgiving Nature of Applications for Scalable Parallel Execution", to appear in IPDPS 2010.
[9]
Nvidia, CUDA documentation: http://www.nvidia.com/object/cuda_develop.html
[10]
Nvidia, "CUDA SDK Code examples", http://www.nvidia.com/object/cuda_get.html
[11]
Nvidia, "CUBLAS Library", http://developer.download.nvidia.com/compute/cuda/1_0/CUBLAS_Library_1.0.pdf
[12]
AMD, "AMD Stream SDK User Guide v 2.0", 2009.
[13]
Intel, Intel Threading Building Blocks 2.2, http://www.threadingbuildingblocks.org/
[14]
Sean Ahern, David Bremer, John Johnson, Holger Jones, et al., "Applications Kernels on Graphics Processing Units: An Analysis of Hidden Markov Models, Support Vector Machines, Hyperspectral Imaging, and Latent Semantic Indexing", High Performance Embedded Computing Workshop (HPEC 2005), September 2005.
[15]
J. M. Cavanagh, T. E. Potok, and X. Cui, "Parallel Latent Semantic Analysis using a Graphics Processing Unit", Proceedings of the 2009 Genetic and Evolutionary Computation Conference, July, 2009.
[16]
S. Venkatasubramanian and R. Vuduc, "Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems", Proceedings of the 23rd international conference on Supercomputing (ICS), June 2009.
[17]
D. Chazan and W. Miranker, "Chaotic Relaxation", Linear Algebra and its Applications, Vol. 2 No 2: 199--222, 1969.
[18]
A. Frommer, D. Szyld, "On asynchronous iterations", Journal of Computational and Applied Mathematics, v. 123 n. 1--2, p. 201--216, Nov. 2000.
[19]
Jian-Tao Sun, Zheng Chen, et al., "Supervised Latent Semantic Indexing for Document Categorization," Proceedings of the Fourth IEEE International Conference on Data Mining, p. 535--538, November 01--04, 2004.
[20]
Steinkraus, D. Buck, I. Simard, P. Y., "Using GPUs for machine learning algorithms", Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), 2005.
[21]
Nvidia, "Nvidia Tesla C870 GPU Computing Processor Board Specification", http://www.nvidia.com/docs/IO/43395/C870-BoardSpec_BD-03399-001_v04.pdf
[22]
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors", Proc. Supercomputing 2009
[23]
F. Vazquez, E. M. Garzon, J. A. Martinez, J. J. Fernandez, "The sparse matrix vector product on GPUs", Technical Report, University of Almeria, June 2009.
[24]
J. Choi, A. Singh, R. Vuduc. "Model-driven autotuning of sparse matrix-vector multiply on GPUs." In Proc. Symp. Principles and Practice of Parallel Programming (PPoPP), 2010.

Cited By

View all
  • (2021)An Adaptive Application Framework with Customizable Quality MetricsACM Transactions on Design Automation of Electronic Systems10.1145/347742827:2(1-33)Online publication date: 2-Nov-2021
  • (2018)A language extension set to generate adaptive versions automaticallyOil & Gas Science and Technology – Revue d’IFP Energies nouvelles10.2516/ogst/201804973(52)Online publication date: 14-Nov-2018
  • (2018)Opportunities and challenges in search interactionCommunications of the ACM10.1145/319518061:12(36-38)Online publication date: 20-Nov-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
March 2010
124 pages
ISBN:9781605589350
DOI:10.1145/1735688
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU
  3. best-effort computing
  4. dependency relaxation
  5. document search
  6. supervised semantic indexing

Qualifiers

  • Research-article

Conference

GPGPU-3

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2021)An Adaptive Application Framework with Customizable Quality MetricsACM Transactions on Design Automation of Electronic Systems10.1145/347742827:2(1-33)Online publication date: 2-Nov-2021
  • (2018)A language extension set to generate adaptive versions automaticallyOil & Gas Science and Technology – Revue d’IFP Energies nouvelles10.2516/ogst/201804973(52)Online publication date: 14-Nov-2018
  • (2018)Opportunities and challenges in search interactionCommunications of the ACM10.1145/319518061:12(36-38)Online publication date: 20-Nov-2018
  • (2018)Approximate CommunicationACM Computing Surveys10.1145/314581251:1(1-32)Online publication date: 10-Jan-2018
  • (2018)Analysis and Classification of Shape-Changing Interfaces for Design and Application-based ResearchACM Computing Surveys10.1145/314355951:1(1-32)Online publication date: 4-Jan-2018
  • (2018)Embeddability in the 3-Sphere Is DecidableJournal of the ACM10.1145/307863265:1(1-49)Online publication date: 23-Jan-2018
  • (2017)TinyLFUACM Transactions on Storage10.1145/314937113:4(1-31)Online publication date: 17-Nov-2017
  • (2016)Input responsiveness: using canary inputs to dynamically steer approximationACM SIGPLAN Notices10.1145/2980983.290808751:6(161-176)Online publication date: 2-Jun-2016
  • (2016)Input responsiveness: using canary inputs to dynamically steer approximationProceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2908080.2908087(161-176)Online publication date: 2-Jun-2016
  • (2016)A Survey of Techniques for Approximate ComputingACM Computing Surveys10.1145/289335648:4(1-33)Online publication date: 18-Mar-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media