Keyword: CUBLAS : Search

article

Out-of-core implementation for accelerator kernels on heterogeneous clouds

The Journal of Supercomputing (JSCO), Volume 74, Issue 2Pages 551–568https://doi.org/10.1007/s11227-017-2141-4

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...

article

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

The Journal of Supercomputing (JSCO), Volume 73, Issue 8Pages 3603–3634https://doi.org/10.1007/s11227-017-1961-6

This paper presents the parallelization on a GPU of the sequential matrix diagonalization (SMD) algorithm, a method for diagonalizing polynomial covariance matrices, which is the most recent technique for polynomial eigenvalue decomposition. We first ...

article

Auto-tuned Krylov methods on cluster of graphics processing unit

International Journal of Computer Mathematics (IJOCM), Volume 92, Issue 6Pages 1222–1250https://doi.org/10.1080/00207160.2014.930137

article

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

Manuel Carcenac

The Journal of Supercomputing (JSCO), Volume 68, Issue 1Pages 365–413https://doi.org/10.1007/s11227-013-1043-3

This paper presents an efficient algorithmic approach to the GPU-based parallel resolution of dense linear systems of extremely large size. A formal transformation of the code of Gauss method allows us to develop for matrix calculations the concept of ...

article

A CUDA implementation of the Continuous Space Language Model

The Journal of Supercomputing (JSCO), Volume 68, Issue 1Pages 65–86https://doi.org/10.1007/s11227-013-1023-7

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was ...

article

Speeding up solving of differential matrix Riccati equations using GPGPU computing and MATLAB

Concurrency and Computation: Practice & Experience (CCOMP), Volume 24, Issue 12Pages 1334–1348https://doi.org/10.1002/cpe.1835

In this work, we developed a parallel algorithm to speed up the resolution of differential matrix Riccati equations using a backward differentiation formula algorithm based on a fixed-point method. The role and use of differential matrix Riccati ...

Article

Iterative Methods for Sparse Linear Systems on Graphics Processing Unit

HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsPages 836–842https://doi.org/10.1109/HPCC.2012.118

Many engineering and science problems require a computational effort to solve large sparse linear systems. Krylov subspace based iterative solvers have been widely used in that direction. Iterative Krylov methods involve linear algebra operations such ...

research-article

Open Access

A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO), Volume 9, Issue 2Article No.: 9, Pages 1–33https://doi.org/10.1145/2207222.2207225

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...

Article

Forecasting High Dimensional Volatility Using Conditional Restricted Boltzmann Machine on GPU

IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumPages 1979–1986https://doi.org/10.1109/IPDPSW.2012.258

Forecasting the volatility of multivariate asset return is an important issue in financial econometric analysis, where the volatility is represented by a conditional covariance matrix (CCM). Traditional models for predicting CCM such as GARCH(1, 1) ...

article

Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA

The Journal of Supercomputing (JSCO), Volume 58, Issue 2Pages 215–225https://doi.org/10.1007/s11227-009-0360-z

This paper describes several parallel algorithmic variations of the Neville elimination. This elimination solves a system of linear equations making zeros in a matrix column by adding to each row an adequate multiple of the preceding one. The parallel ...

article

CUDA-enabled implementation of a neural network algorithm for handwritten digit recognition

Optical Memory and Neural Networks (SPOMNN), Volume 20, Issue 2Pages 98–106https://doi.org/10.3103/S1060992X11020032

Using a convolutional neural network as an example, we discuss specific aspects of implementing a learning algorithm of pattern recognition on the GPU graphics card using NVIDIA CUDA architecture. The training time of the neural network on a video-...

Article

Accelerating Image Retrieval Using Factorial Correspondence Analysis on GPU

CAIP '09: Proceedings of the 13th International Conference on Computer Analysis of Images and PatternsPages 565–572https://doi.org/10.1007/978-3-642-03767-2_69

We are interested in the intensive use of Factorial Correspondence Analysis (FCA) for large-scale content-based image retrieval. Factorial Correspondence Analysis, is a useful method for analyzing textual data, and we adapt it to images using the SIFT ...

research-article

Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing UnitsPages 28–37https://doi.org/10.1145/1513895.1513899

Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular ...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Conference Event

Proceedings Series

Publication Date

Out-of-core implementation for accelerator kernels on heterogeneous clouds

GPU parallelization of the sequential matrix diagonalization algorithm and its application to high-dimensional data

Auto-tuned Krylov methods on cluster of graphics processing unit

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

A CUDA implementation of the Continuous Space Language Model

Speeding up solving of differential matrix Riccati equations using GPGPU computing and MATLAB

Iterative Methods for Sparse Linear Systems on Graphics Processing Unit

A unified optimizing compiler framework for different GPGPU architectures

Forecasting High Dimensional Volatility Using Conditional Restricted Boltzmann Machine on GPU

Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA

CUDA-enabled implementation of a neural network algorithm for handwritten digit recognition

Accelerating Image Retrieval Using Factorial Correspondence Analysis on GPU

Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Conference Event

Proceedings Series

Publication Date

Save to Binder