Abstract. Many image processing algorithms rely on nearest neighbor (NN) or on the k nearest neighbor (kNN) search problem. Several meth-ods have been proposed to reduce the computation time, for instance using space partitionning.... more
Abstract. Many image processing algorithms rely on nearest neighbor (NN) or on the k nearest neighbor (kNN) search problem. Several meth-ods have been proposed to reduce the computation time, for instance using space partitionning. However, these methods are very slow in ...
Solving linear system with a magnitude of thousand to ten thousand of unknowns takes a very long time in serial fashion. Furthermore, linear system that is discretised from Partial Differentiation Equations (PDE) is also typically solved... more
Solving linear system with a magnitude of thousand to ten thousand of unknowns takes a very long time in serial fashion. Furthermore, linear system that is discretised from Partial Differentiation Equations (PDE) is also typically solved by a class of iterative method. Therefore, by parallelising Jacobi Iterative Method using OpenACC, we are able to review and compare OpenACC’s capabilities in accelerating Jacobi Iterative Method using compiler directives approach as opposed to CUDA’s approach. Moreover, we implemented OpenACC in two distinctive domains, where the first domain is on manycore environment with a testbed hardware of Nvidia GeForce GTX 980 and the second domain is on multicore environment where 4 CPU(s) of AMD Opteron 6272 chips are clustered on a single machine with a total of 64 cores. This research project has shown great potentials of the implemented Jacobi Iterative Method using OpenACC as we managed to obtain verily rewarding results. Where, the highest speedup gain is up to 82x faster on GPU with Unified Memory (UM) enabled and 55x times faster on CPU where all 64 cores are fully utilized, and this is when the number of unknowns to be solved is 25,000 and 2,500 respectively.
Cycles count in a graph is an NP-complete problem. This work minimizes the execution time to solve the problem compared to the other traditional serial, CPU based one. It reduces the hardware resources needed to a single commodity GPU. We... more
Cycles count in a graph is an NP-complete problem. This work minimizes the execution time to solve the problem compared to the other traditional serial, CPU based one. It reduces the hardware resources needed to a single commodity GPU. We developed an algorithm to approximate counting the number of cycles in an undirected graph, by utilizing a modern parallel computing paradigm called CUDA (Compute Unified Device Architecture) from nVIDIA, using the capabilities of the massively parallel multi-threaded specialized processor called Graphics Processing Unit (GPU). The algorithm views the graph from combinatorial perspective rather than the traditional adjacency matrix/list view. The design philosophy of the algorithm shows that each thread will perform simple computation procedures in finite loop iterations to be executed in polynomial time. The algorithm is divided into two stages, the first stage is to extract a unique number of vertices combinations for a given cycle length using combinatorial formulas, and then examine whether given combination can for a cycle or not. The second stage is to approximate the number of exchanges (swaps) between vertices for each thread to check the possibility of cycle existence. An experiment was conducted to compare the results between the proposed algorithm and another distributed serial based algorithm based on the Donald Johnson backtracking algorithm.
Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper... more
Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper presents a new approach through whichC'or Java programmers can access these languages without having to focus on the technical or language-specific details. A prototype of the approach, named CUDACL, is introduced through which a programmer can specify one or more parallel ...
Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks are major threat to the internet and is a serious cyber-crime. The rapid increase in number of people using internet and the development of technology has given... more
Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks are major threat to the internet and is a serious cyber-crime. The rapid increase in number of people using internet and the development of technology has given birth to new virus and worms which can exploit our system. Attackers use the latest techniques to perform DoS attacks. There are numerous tools to perform DoS attack from millions of compromised system and can mess up any system or network in short period of time. There are many well-known counter measures available like puzzle based defense mechanism. However an attacker can inflate the ability of DoS attack in solving the puzzle by using cheap and widely available GPUs. Software Puzzles are effective against such resource inflated attacks. But this alone doesn't has a strategy to avoid unfairly delay of time during slower traffic. In this work, we propose a simple firewall architecture to prevent DoS attacks and GPU inflated DoS attacks. The proposed architecture has three stages, (i) attack detection, (ii) rate limiting and (iii) software puzzle. The traffic is analyzed, and if there is a scenario of a heavy traffic or an attack, the rate limiting technique is executed and then software puzzle are given to each and every requests to prevent users from using GPUs to solve the puzzles.
This paper presents a comparison between two architectures for parallel computing: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance... more
This paper presents a comparison between two architectures for parallel computing: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance comparison of the two architectures. However, there is not any complete and recent paper that highlights clearly which architecture can be considered the most efficient. The goal of this work is to make a comparison only in hardware and software levels, technological trends and ease of use highlighting the one that may present the best cost-effective in general. In order to it, we describe the main works that have used at least one of the architectures. It was observed that the choice of OpenCL may seem more obvious for being a heterogeneous system. Nevertheless, we concluded that CUDA, although being used only in graphics cards from NVIDIA®, has been a reference and more used recently.
GPUs provide impressive computing power, but GPU programming can be challenging. Here, an experience in porting real-world earthquake code to Nvidia GPUs is described. Specifically, an annotation-based programming model, called Mint, and... more
GPUs provide impressive computing power, but GPU programming can be challenging. Here, an experience in porting real-world earthquake code to Nvidia GPUs is described. Specifically, an annotation-based programming model, called Mint, and its accompanying source-to-source translator are used to automatically generate CUDA source code and simplify the exploration of performance tradeoffs.
Abstract. We introduced a real time Image Processing technique using modern programmable Graphic Processing Units (GPU) in this paper. GPU is a SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing... more
Abstract. We introduced a real time Image Processing technique using modern programmable Graphic Processing Units (GPU) in this paper. GPU is a SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA's new GPU Programming ...
Genetic Programming is very efficient in problem solving compared to other proposals but its performance is very slow when the size of the data increases. This paper proposes a model for multi-threaded Genetic Programming classification... more
Genetic Programming is very efficient in problem solving compared to other proposals but its performance is very slow when the size of the data increases. This paper proposes a model for multi-threaded Genetic Programming classification evaluation using a NVIDIA CUDA GPUs programming model to parallelize the evaluation phase and reduce computational time. Three different well-known Genetic Programming classification algorithms are evaluated using the parallel evaluation model proposed. Experimental results using UCI Machine Learning data sets compare the performance of the three classification algorithms in single and multithreaded Java, C and CUDA GPU code. Results show that our proposal is much more efficient.
Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper... more
Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper presents a new approach through which `C' or Java programmers can access these languages without having to focus on the technical or language-specific details. A prototype of the approach, named CUDACL, is introduced through which a programmer can specify one or more parallel blocks in a file and execute in a GPU. CUDACL also helps the programmer to make CUDA or OpenCL kernel calls inside an existing program. Two scenarios have been successfully implemented to assess the usability and potential of the tool. The tool was created based on a detailed analysis of the CUDA and OpenCL programs. Our evaluation of CUDACL compared to other similar approaches shows the efficiency and effectiveness of CUDACL.
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the... more
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs...