CUDA Programming, GPU Computing Research Papers

Graphics Processing Units (GPUs) are specialized coprocessors that were initially conceived for the purpose of accelerating vector operations, such as graphics rendering. Writing and configuring efficient algorithms for GPU devices is... more

g-Spike, a parallel algorithm for solving general nonsymmetric tridiagonal systems for the GPU, and its CUDA implementation are described. The solver is based on the Spike framework, applying Givens rotations and QR factorization without... more

The computational epidemiology is the development and use of computational models that aims to understand the proliferation of diseases of the dynamic point of view. The computational models are capable to simulate the behavior of an... more

CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by Nvidia which provides the ability of using GPUs to run computationally intensive programs. This presentation provides a brief overview of CUDA,... more

Volume rendering is an important area of study in computer graphics, due to its application in areas such as medicine, physic simulations, oil and gas industries, and others. The main used method nowadays for volume rendering is ray... more

The present paper discusses digital filter banks in the context of wideband radio monitoring tasks including DFT-modulated filter banks and the weighted overlap-add (WOLA) algorithm. Filter bank software-hardware implementations are... more

This thesis focuses on the development, implementation and optimization of pattern-matching algorithms in two different, yet closely-related research fields: malicious code detection in intrusion detection systems and digital forensics... more

This thesis focuses on the development, implementation and optimization of pattern-matching algorithms in two different, yet closely-related research fields: malicious code detection in intrusion detection systems and digital forensics (with a special focus on the data recovery process and the metadata collection stages it involves). The thesis introduces the motivational backgrounds for the development of the work, later on presents the related work and then continues with the main achievements obtained, while in the end a few conclusions and future research directions are discussed.

The main four chapters in this thesis show the main contributions of our work and address the following topics: we present an efficient storage mechanism for hybrid CPU/GPU-based systems, and compare it with other known approaches to date. We then propose an innovative, highly parallel approach to the fast construction of very large Aho-Corasick and Commentz-Walter pattern matching automata on hybrid CPU/GPU-based systems, and compare it to existing sequential approaches. Later on we propose a new heuristics for profiling malicious behavior based on system-call analysis, using the Aho-Corasick algorithm, and also discuss a new hybrid compression mechanism for this automata based on dynamic programming, that reduces the storage space required for it. Finally, we propose an efficient new method for collecting metadata and helping the human operator or automated tools used in the data recovery process as part of the computer forensic investigations.

The research and models obtained in this thesis extend the existing literature in the field of intrusion detection systems (malicious code detection in particular), by presenting: an innovative heuristics for behavioral analysis of code in executable files through system-call interception, a novel and highly efficient approach to efficiently storing pattern-matching automata in hybrid CPU/GPU-based systems, which serves as the base for an innovative model for the fast, GPU-accelerated construction of such very large automata (for both the Aho-Corasick and Commentz-Walter algorithms) and a new hybrid compression technique applied to the Aho-Corasick automata using a dynamic programming approach, that reduces storage space significantly.

The Graphics Processing Unit (GPU) has become an integral part of mainstream computing. The advancement and evaluation of GPU has lead to a significant performance improvement of many algorithms used in day today life. Being powerful... more

A B S T R A C T With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse... more

A B S T R A C T With the technology development of medical industry, processing data is expanding rapidly and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU) addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA) is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. A few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are discussed. 1. Introduction Computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET) and ultrasound are famous medical modalities that produce the 2D, 3D and 4D types of medical images which are guiding the diagnosis process and treatment planning. The medical image processing and analysis are computationally expensive while medical imaging data dimension increasing [1]. The conventional CPU with limited multi-core is not sufficient to process these types of huge data. Graphics processing unit (GPU) is a new technology capable for finding out solutions to the computational problems in all the engineering and medical fields. In the medical industry, GPU is more suitable for processing the higher dimension data. GPU computation has provided a huge edge over the central processing units (CPU) with respect to computation speed. GPU is highly parallel, multithread, multiple core processors and has high memory bandwidth to give the solution to the computational problems [2]. The main reason for the evolution of powerful GPUs is the constant demand for greater realism in computer games. During the past few decades, the computational performance of GPUs has increased much more quickly than that of conventional CPUs. Hence it plays a major role in the field of modern industrial research and development. GPU has already achieved a significant speed (2x-1000x) than CPU implementation on various fields [3] [4] [5]. GPU is well suitable to implement the program execution with the different data elements. This process is called as data parallelism. Data parallelism is maps data elements to parallel threads available in GPU [6]. Data parallelism gives high gains in independent processes between data elements. The prime areas of data parallelism are 3D rendering, stereo vision, pattern recognition, image, video and medical industry applications. A large performance gap occurs between GPU and general purpose multi-core CPU. Architectural level comparison of CPU and GPU are given in Fig. 1. The design of a CPU is optimized for sequential programming. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. Modern CPU microprocessors typically have four large processor cores designed to deliver strong sequential code performance but not enough to process the huge data. A basic model of GPU has large number of processor cores, ALU's, control units and various types of memories. In general, heterogeneous CPU and GPU computation is appreciable instead of standalone CPU or GPU implementation. The dependent processes are recommended in CPU and the independent processes can be accelerated by the GPU. GPU with high amount of threads give better performance. This paper reviews the implication of GPU programming model in medical image analysis and illustrated some applications with examples. The general framework of medical image analysis pipeline is given in Fig. 2. The computational complexities of all these fields are increasing

With the newly available deep learning techniques, new class of problems can be tackled. This is for example the case when dealing with classification or regression problem with an input space of infinite dimension. The mapping input... more

Bookmark
Download
- by Randall BALESTRIERO and +1
  Romain Cosentino
- •
- 17
  Artificial Intelligence, Computer Vision, Reinforcement Learning, Machine Learning

A flow diagram is a graphical presentation of an energy or chemical system with its components and their interconnections through mass and energy streams. An automatic drawing algorithm of flow diagrams has been developed and presented in... more

This paper presents a parallel implementation of the hybrid BiCGStab(2) (bi-conjugate gradient stabilized) iterative method in a GPU (graphics processing unit) for solution of large and sparse linear systems. This implementation uses the... more

This GPU book teaches both CUDA and CPU Parallel Programming using pThreads.

In this paper, the performance of the Cyclic Reduction (CR) algorithm for solving tridiagonal systems is improved with the aid of efficient global memory transactions on Graphics Processing Units (GPU). To achieve maximum memory... more

This paper focuses on the anticipatory enhancement of methods of detecting stealth software. Cyber security detection tools are insufficiently powerful to reveal the most recent cyber-attacks which use malware. In this paper, we will... more

Bookmark
Download
- by Igor Korkin and +1
  Ivan Nesterow
- •
- 20
  Parallel Computing, Parallel Programming, GPU Computing, Parallel Processing

This paper presents a Graphics Processing Unit (GPU) based parallel implementation for the All Pairs Shortest Paths problem. The implementation is based on the Floyd-Warshall algorithm and takes full advantage of the highly multithreaded... more

Bookmark
Download
- by Lauro de Paula and +3
  Wanderley De
  Wellington Martins
  walid jradi
- •
- 12
  Parallel Algorithms, Parallel Computing, Graph Theory, Parallel Programming

We present a fast and accurate 3D hand tracking method which relies on RGB-D data. The method follows a model based approach using a hierarchical particle filter variant to track the model’s state. The filter estimates the probability... more

Penjajaran sekuen merupakan hal yang fundamental dalam bioinformatika. Penjajaran sekuen dapat menganalisis sekuen. Penjajaran sekuen menggunakan pemograman dinamis yang memiliki kompleksitas O(mn) yang mana m dan n adalah panjang sekuen... more

Penjajaran sekuen merupakan hal yang fundamental dalam bioinformatika. Penjajaran sekuen dapat menganalisis sekuen. Penjajaran sekuen menggunakan pemograman dinamis yang memiliki kompleksitas O(mn) yang mana m dan n adalah panjang sekuen sehingga penjajaran sekuen untuk sekuen yang panjang membutuhkan waktu yang lama. Penjajaran sekuen secara paralel dapat mempercepat proses penjajaran. Penjajaran paralel untuk mengimplementasikannya dibutuhkan skema partisi salah satunya adalah rowwise dan columnwise. GPU memiliki kernel yang berisi multi threaded dan dapat mengelola grafis dan aritmatika. Tujuan dari penelitian ini adalah menerapkan algoritme global pairwise alignment pada paralel graphic processing unit dengan skema partisi data rowwise dan columnwise. Penelitian ini dibagi menjadi 3 tahapan, tahapan pertama melakukan penjajaran global dengan algoritme Needleman-Wunsch, tahapan kedua melakukan paralel dengan skema rowwise dan ketiga adalah analisis. Skema rowwise dengan menggunakan cuda memiliki speed-up tertinggi sebesar 4.86 kali, sedangkan skema columnwise memiliki speed-up tertinggi sebesar 6,67 kali. Abstract A Sequences alignment is fundamental in bioinformatics. Sequences alignment can be analyzed sequences. Sequence Alignment using dynamic programming that has complexity O(mn) where m and n is the length of the sequence so that the alignment sequence for long sequences takes a long time. Sequences alignment in a parallel environment can speed up the process of alignment. Sequence alignment in parallel environment need partitioning scheme to implement, one of them is rowwise. The use of GPU is faster than the CPU as the GPU has a kernel that contains multi-threaded. GPU CUDA is a technology that can manage graphics and arithmetic. The aim of this study is to apply global pairwise alignment algorithm in parallel graphics processing unit with a data partitioning scheme rowwise. This study was divided into three stages, the first stage perform global alignment with the Needleman-Wunsch algorithm, the second stage did rowwise parallel with the scheme and the third is the analysis. Schemes rowwise using CUDA has the highest speedup of 4.86 times.

Bookmark
Download
- by Bastian Rdp and +1
  Bastian Dwi Putra
- •
- 6
  Bioinformatics, Parallel Computing, Parallel Programming, GPU Computing

Fluid simulation has recently become possible to do in real-time utilizing modern programmable GPUs. In this paper, a comparison is made between more traditional CFD methods which solve the Navier-Stokes PDEs and the alternative discrete... more

This short paper present a collection of GPU lightweight decompression algorithms implementations within a FOSS library, Giddy – the first to be published to offer such function-ality. As the use of compression is important in... more

The analysis and the understanding of object manipulation scenarios based on computer vision techniques can be greatly facilitated if we can gain access to the full articulation of the manipulating hands and the 3D pose of the manipulated... more

Volume rendering is an important area of study in computer graphics, due to its application in areas such as medicine, physic simulations, oil and gas industries, and others. The main used method nowadays for volume rendering is ray... more

Umumnya, Transformasi Fourier Diskrit digunakan untuk mengolah citra dengan mengubah sinyal yang berdomain waktu / spasial ke bentuk sinyal berdomain frekuensi, atau sebaliknya, mengubah sinyal berdomain frekuensi ke sinyal berdomain... more

This work has the goal to study how an efficient deep packet inspection (DPI) algorithm may be implemented using the graphical processing unit (GPU) CUDA (Computer Unified Device Architecture) enabled boards existing in personal... more

A new step-by-step comprehensive MR physics simulator (MRISIMUL) of the Bloch equations is presented. The aim was to develop a magnetic resonance imaging (MRI) simulator that makes no assumptions with respect to the underlying pulse... more

Usage of multiple unmanned aerial vehicles (UAV) in a certain mission makes flight route planning more complicated and slower. In order to obtain better performance, in the literature, most of the researchers propose using evolutionary... more

The last years have seen the rise of using GPU’s not only for graph- ics computing, but also for a more global use in parallel computing. This completely new field comes with dedicated GPU’s (GPGPU, Global Purpose GPU) that are not... more

The rapid evolution of CUDA GPU architecture and the new heterogenous platforms that break the hegemony of x86 offer opportunities for performance optimizations, but also pose challenges for scalable heterogeneous parallelization of the... more

Рассмотрен мониторинг широкого частотного диапазона с использованием ДПФ-модулированных банков фильтров. Описаны равнополосные и неравнополосные реализации банков фильтров, включая прямую реализа-цию с полной модуляцией, критически... more

En este artículo se plantea una nueva paralelización del algoritmo de enjambre de partículas (PSO) haciendo uso de múltiples GPUs. Se plantean dos implementaciones basadas en la arquitectura CUDA de NVIDIA y la conexión P2P a través del... more

It is well known that non-binary LDPC codes outperform the BER performance of binary LDPC codes for the same code length. The superior BER performance of non-binary codes comes at the expense of more complex decoding algorithms that... more

Bookmark
Download
- by Joao Andrade and +1
  G. Falcão
- •
- 6
  Parallel Algorithms, Parallel Computing, GPGPU (General Purpose GPU) Programming, Non-binary LDPC

A new strategy is proposed for implementing computationally intensive high-throughput decoders based on the long length irregular LDPC codes adopted in the DVB-S2 standard. It is supported on manycore graphics processing unit (GPU)... more

Molecular Dynamics (MD) is one of processes that requires High Performance Computing environments to complete its jobs. In the preparation of virtual screening experiments, MD is one of the important processes particularly for tropical... more

This paper presents a computational performance comparison between some iterative methods used for linear systems solution. The goal is to show that the use of parallel processing provided by a Graphics Processing Unit (GPU) may be more... more

The availability of Internet, line-of-sight and satellite identification and surveillance information as well as low-power, low-cost embedded systems-on-a-chip and a wide range of visible to long-wave infrared cameras prompted Embry... more

The availability of Internet, line-of-sight and satellite identification and surveillance information as well as low-power, low-cost embedded systems-on-a-chip and a wide range of visible to long-wave infrared cameras prompted Embry Riddle Aeronautical University to collaborate with the University of Alaska Arctic Domain Awareness Center (ADAC) in summer 2016 to prototype a camera system we call the SDMSI (Software-Defined Multi-spectral Imager). The concept for the camera system from the start has been to build a sensor node that is drop-in-place for simple roof, marine, pole-mount, or buoy-mounts. After several years of component testing, the integrated SDMSI is now being tested, first on a roof-mount at Embry Riddle Prescott. The roof-mount testing demonstrates simple installation for the high spatial, temporal and spectral resolution SDMSI. The goal is to define and develop software and systems technology to complement satellite remote sensing and human monitoring of key resources such as drones, aircraft and marine vessels in and around airports, roadways, marine ports and other critical infrastructure. The SDMSI was installed at Embry Riddle Prescott in fall 2016 and continuous recording of long-wave infrared and visible images have been assessed manually and compared to salient object detection to automatically record only frames containing objects of interest (e.g. aircraft and drones). It is imagined that ultimately users of the SDMSI can pair with it via wireless to browse salient images. Further, both ADS-B (Automatic Dependent Surveillance-Broadcast) and S-AIS (Satellite Automatic Identification System) data are envisioned to be used by the SDMSI to form expectations for observing in future tests. This paper presents the preliminary results of several experiments and compares human review with smart image processing in terms of the receiver-operator characteristic. The system design and software are open architecture, such that other researchers are encouraged to construct and participate in sharing results and networking identical or improved versions of the SDMSI for safety, security and drop-in-place scientific image sensor networking. Nomenclature ADS-B = Automatic Dependent Surveillance – Broadcast, aviation identification and tracking AIFC = Arctic Information Fusion Concept, an ADAC sensor network prototype CBONS = Community Based Observing Network System, or human field monitoring CUDA = Compute Unified Device Architecture, GP-GPU acceleration DMM = Digital Multi-Meter, used for current monitoring and power use analysis EO/IR = Electro-Optical / Infrared instrumentation GPGPU = General Purpose Graphics Processing Unit LWIR = Long Wave Infrared, typically 8-15 micron wavelength electromagnetic radiation MWIR = Medium Wave Infrared, typically 3-8 micron wavelength electromagnetic radiation NAS = National Airspace NIR = Near Infrared, typically 0.75-1.4 micron wavelength electromagnetic radiation OpenCV = Open Computer Vision, an open source library in C/C++ Panchromatic = Visible and part of VNIR electromagnetic radiation in 0.45-0.8 micron range PCIe = Peripheral Component Interconnect Express, a device interface bus ROC = Receiver operator Characteristic, compares true positive and false positive rates S-AIS = Satellite Automatic Identification System – automatic marine tracking service SDMSI = Software Defined Multi-Spectral Imager SOD = Salient Object Detector SWIR = Short Wave Infrared, typically 1.4-3 micron wavelength electromagnetic radiation TAP = Trans Alaska Pipeline UAS = Unoccupied Aerial System USB3 = Universal Serial Bus, Revision 3, operating at 5 gigabits per second (625MB/sec) USCG = US Coast Guard VNIR = Visible and Near Infrared, typically 04.4-1 micron wavelength range Introduction The purpose of the research presented in this paper is to evaluate the hypothesis that pole-mount cameras on buoys, buildings or towers, and marine vessels can improve situational awareness for the agencies and organizations that manage campuses, ports, airports and other critical infrastructure where drone, aircraft and marine vessels cooperate compared to use of satellite remote sensing and human monitoring. The assertion is that a multi-spectral imaging system defined by software providing concurrent visible and infrared image collection and processing can also be defined and improved through software upgrades over time to perform better than security camera continuous monitoring or occasional satellite imaging. Finally, that the result will be better spatial, temporal, and spectral resolution observing of key areas of interest in regions that are hard to monitor such as Alaska and the Arctic compared to current methods employed. This hypothesis has been initially tested at Embry Riddle by monitoring shared airspace traffic including drones, aircraft and wildlife to test whether the concept of a smart SDMSI might also have value for aerial surveys and surveillance as well as marine environments. The SDMSI system design that has been prototyped and built and tested in Arizona is shown in Figure 1 below. The camera system includes a Tegra K1 SoC (4 processor

Introduction to CUDA and STREAM.

Bookmark
Download
- by Juan F . González
- •
- 3
  CUDA, CUDA Programming, GPU Computing, CUDA Research

This paper presents a comparison between two parallel architectures: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance comparison of... more

Overview CUDA stands for the "Compute Unified Device Architecture", which is a free software platform provided by NVidia. It enables users to control GPUs by writing programs akin to C++. All CUDA software can be downloaded from CUDA... more

Overview CUDA stands for the "Compute Unified Device Architecture", which is a free software platform provided by NVidia. It enables users to control GPUs by writing programs akin to C++. All CUDA software can be downloaded from CUDA Zone.CUDA is very similar in nature to cores in a CPU. You might have a dual or quad-core CPU, and that is the closest analog to CUDA cores that most people will have any experience with. However, CUDA cores tend to be more specialized for stream processing as opposed to generalized like a CPU, so there's less logic to duplicate making it easier to fit more CUDA cores onto a GPU. That said, CUDA's primary use seems to be for allowing the GPU to be used for more general purpose tasks, like encoding videos, or even hardware acceleration of video decoding. For gaming, I would go more for things like memory bus bandwidth, as well as memory and GPU clock speeds. If you can find a card that also has more CUDA cores compared to a very similar card, then even better, but I'd put CUDA cores a ways down on the list for gaming. Even just considering GPUs, while NVidia is the most popular, it's not the only game in town; there are also AMD's (ATI's). They use OpenCL, which the Kronos group labels. It is a different API for GPU-like (stream, SIMD) processing. OpenCL is a standard, and you can run OpenCL-written code on NVidia gear, but it won't run as fast as CUDA code. Using CUDA allows the programmer to take advantage of the massive parallel computing power of an NVidia graphics card in order to do general purpose computation. Before continuing, it's worth talking about this for a little bit longer. Multicore CPU and GPU CPUs like Intel Core 2 Duo and AMD Opteron are good at doing one or two tasks at a time, and doing those tasks very quickly. Graphics cards, on the other hand, are good at doing a massive number tasks at the same time, and doing those tasks relatively quickly. To put this into perspective, suppose you have a 20 inch monitor with a standard resolution of 1,920 x 1200. An NVidia graphics card has the computational ability to calculate the color of 2,304,000 different pixels, many times a second. In order to accomplish this feat, graphics cards use dozens, even hundreds of ALUs. Fortunately, NVidia's ALUs are fully programmable, which enables us to harness an unprecedented amount of computational power into the programs that we write. As stated previously, CUDA lets the programmer take advantage of the hundreds of ALUs inside a graphics processor, which is much more powerful than the handful of ALUs available in any CPU. However, this does put a limit on the types of applications that are well suited to CUDA.

This paper proposes a parallel regression formulation to reduce the computational time of variable selection algorithms. The proposed strategy can be used for several forward algorithms in order to select uncorrelated variables that... more

This paper presents a comparison between two architectures for parallel computing: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance... more

Research based on RNA secondary structure has been conducted for decades. Accurately finding minimal free energy (MFE) of the secondary structure of RNA is one of the prime research areas. There are different approaches to find the MFE,... more

Bookmark
Download
- by Rifat Hossain and +3
  Ishtiak Morshed
  Md Asif Rezwan Shishir
  Tomal Mahdi
- •
- 16
  Information Systems, Bioinformatics, Computer Science, Information Technology

Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as... more

Bookmark
Download
- by Javier Prades and +2
  Blesson Varghese
  Carlos Reano
- •
- 32
  Risk Management and Insurance, Computer Science, Parallel Algorithms, Distributed Computing

—Models are useful to represent abstractions of software and hardware processes. The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using... more

Solving banded linear systems is an important computational kernel in many applications. Over the years, a variety of algorithms and modules in numerical libraries for this task have been proposed that are suitable for computer systems... more

CUDA Programming, GPU Computing

Log In