Simulation of quantum computers using classical computers is a hard problem with high memory and computational requirements. Parallelization can alleviate this problem, allowing the simulation of more qubits at the same time or the same... more
Simulation of quantum computers using classical computers is a hard problem with high memory and computational requirements. Parallelization can alleviate this problem, allowing the simulation of more qubits at the same time or the same number of qubits to be simulated in less time. A promising approach is to exploit the high performance computing capabilities provided by the latest graphical processing units. In this paper we present a parallel implementation of the QC-lib quantum computer simulator on the GPU using the CUDA programming model. The proposed scheme for partitioning the terms that describe the state of a quantum register takes advantage of the specific characteristics of the CUDA memory spaces and allows for an efficient parallelization of the general singe qubit operator. Experimental results indicate that very good speed-ups can be obtained in contrast with the sequential implementation.
The computational epidemiology is the development and use of computational models that aims to understand the proliferation of diseases of the dynamic point of view. The computational models are capable to simulate the behavior of an... more
The computational epidemiology is the development and use of computational models that aims to understand the proliferation of diseases of the dynamic point of view. The computational models are capable to simulate the behavior of an epidemic and its effects on the population of a region and develop strategies of control and prevention of diseases. The computational epidemiology proposes models that classify individuals as susceptible, infected and recovered (SIR) and based on individual (IBM). The SIR model considers the homogeneous distribution of individuals in space and time and it is not capable to tell the persistence or eradication of infectious diseases. The IBM reproduces the premises of the SIR model to the individual analyse. The model computational cost to simulate the behavior of an epidemic grows up with the size of the population considered. This work implements a IBM using parallel programming through the use of GPU’s (Graphics Processor Unit) compatible with CUDA (Compute Unified Device Architecture). The goal is to gain performance compared to the traditional implementation. The results show that in some cases the reduction of computational time is around 170% and the growth rate with the increase of the population size is, in some cases, two times lower in a parallel implementation when compared to a sequencial implementation.
CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by Nvidia which provides the ability of using GPUs to run computationally intensive programs. This presentation provides a brief overview of CUDA,... more
CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by Nvidia which provides the ability of using GPUs to run computationally intensive programs. This presentation provides a brief overview of CUDA, parallelism in the GPU and the syntax needed to know to run the program on GPU.
Traditionally, the attendance of the students has been a major concern for the colleges and the faculties have to spend quite some time of their lectures for taking the attendance manually. In this paper, we are introducing a new way of... more
Traditionally, the attendance of the students has been a major concern for the colleges and the faculties have to spend quite some time of their lectures for taking the attendance manually. In this paper, we are introducing a new way of attendance monitoring by making use of smartphones available with the teachers. We have suggested the use of YOLO algorithm for face detection and Siamese network for face recognition. This system will automatically mark the attendance of the students and thereby save the time and efforts for the faculties. The designed system will be quite efficient and reliable as Siamese network has proven to render high accuracies in face recognition.
It is important to obtain the results of methods that are used in solving scientific and engineering problems rapidly for users and application developers. Parallel programming techniques have been developed alongside serial programming... more
It is important to obtain the results of methods that are used in solving scientific and engineering problems rapidly for users and application developers. Parallel programming techniques have been developed alongside serial programming because the importance of performance has been increasing day by day while developing computer applications.Various methods such as Gauss Elimination (GE) Method, Gauss-Jordan Elimination (GJE) Method, Thomas Method, etc. have been used in solution of Linear Equation System (LES). In this study, performance comparison is done using Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) for nxn matrix via GJE Method. GJE Method is a variant of GE which is used in solving linear system equations (Ax=B). Each step of GJE Method solution algorithm is independent from each other and also the method is appropriate for parallel computing structure; therefore, this method is preferred within the scope of this study. Application coded in C programming language is developed using OpenMP and CUDA. OpenMP is an Application Program Interface that allows parallel programming using compiler directives on Central Processing Unit (CPU). CUDA is known as NVIDIA's parallel computing architecture and it enables significant increases in computing performance by the Graphics Processing Unit (GPU).Application is realized on Intel Core 2 Quad CPU Q8200 2.33 GHz processor, GeForce 9500 GT graphic card. It is observed that application using Grid-Block-Thread structure and optimized with CUDA displays higher performance than OpenMP in terms of time.
We propose an implementation for quad-tree based solid object coloring using Compute Unified Device Architecture (CUDA). There are numerous different techniques in use for solid object coloring. One commonly used technique is the... more
We propose an implementation for quad-tree based solid object coloring using Compute Unified Device Architecture (CUDA). There are numerous different techniques in use for solid object coloring. One commonly used technique is the quad-tree, which has evolved from work in different fields. A quad-tree is a tree data structure in which each internal node has exactly four children. The quad-tree somewhat follows the tree data structure commonly used in computer science. The normal tree data structure looks like an upside down tree, where a parent node at the top of the tree has one or more children nodes connected to it. The aim of this study is coloring of a solid object using screen splitting method. The screen is divided into squares via this method and whether one or more points of the object are available in the separated parts is searched. According to the existing points, algorithm is applied and the object coloring is provided by reducing pixel size. We implemented our algorithm using the Graphics Processing Unit (GPU) computing and compared their performance with a CPU implementation. Nvidia CUDA library has been used for the GPU computing. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. We have tried our study on different systems that have different GPUs and CPUs. The computation studies were also evaluated for different solid objects. When we compared the results obtained from both systems, a better performance was obtained with GPU computing. According to results, GPU computation approximately worked 20 times faster than the CPU computation.
We present a fast and accurate 3D hand tracking method which relies on RGB-D data. The method follows a model based approach using a hierarchical particle filter variant to track the model’s state. The filter estimates the probability... more
We present a fast and accurate 3D hand tracking method which relies on RGB-D data. The method follows a model based approach using a hierarchical particle filter variant to track the model’s state. The filter estimates the probability density function of the state’s posterior. As such, it has increased robustness to observation noise and compares favourably to existing methods that can be trapped in local minima resulting in track loses. The data likelihood term is calculated by measuring the discrepancy between the rendered 3D model and the observations. Extensive experiments with real and simulated data show that hand tracking is achieved at a frame rate of 90fps with less that 10mm average error using a GPU implementation, thus comparing favourably to the state of the art in terms of both speed and tracking accuracy.
The analysis and the understanding of object manipulation scenarios based on computer vision techniques can be greatly facilitated if we can gain access to the full articulation of the manipulating hands and the 3D pose of the manipulated... more
The analysis and the understanding of object manipulation scenarios based on computer vision techniques can be greatly facilitated if we can gain access to the full articulation of the manipulating hands and the 3D pose of the manipulated objects. Currently, there exist methods for tracking hands in interaction with objects whose 3D models are known. There are also methods that can reconstruct 3D models of objects that are partially observable in each frame of a sequence. However, to the best of our knowledge, no method can track hands in interaction with unknown objects. In this paper we propose such a method. Experimental results show that hand tracking can be achieved with an accuracy that is comparable to the one obtained by methods that assume knowledge of the object models. Additionally, as a by-product, the proposed method delivers accurate 3D models of the manipulated objects.
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple... more
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
This paper presents a computational performance comparison between some iterative methods used for linear systems solution. The goal is to show that the use of parallel processing provided by a Graphics Processing Unit (GPU) may be more... more
This paper presents a computational performance comparison between some iterative methods used for linear systems solution. The goal is to show that the use of parallel processing provided by a Graphics Processing Unit (GPU) may be more feasible, for making possible the fast solution of linear equations systems in order that complex and sparse problems can be solved in a short time. To validate the paper a GPU through the NVIDIA's Compute Unified Device Architecture (CUDA) was employed and the computational performance was compared with Jacobi, Gauss-Seidel, BiCGStab iterative methods and BiCGStab(2) parallelized in the solution of linear systems of varying sizes. There was a significant acceleration in tests with the parallelized code, which increases considerably as much as systems increase. The results showed that the application of parallel processing in a robust and efficient method, as BiCGStab(2), it is often necessary for the simulations be performed with quality and in not prohibitive time.
Temporal data mining algorithms are becoming increasingly important in many application domains including computational neuroscience, especially the analysis of spike train data. While application scientists have been able to readily... more
Temporal data mining algorithms are becoming increasingly important in many application domains including computational neuroscience, especially the analysis of spike train data. While application scientists have been able to readily gather multi-neuronal datasets, analysis capabilities have lagged behind, due to both lack of powerful algorithms and inaccessibility to powerful hardware platforms. The advent of GPU architectures such as Nvidia's GTX 280 offers a cost-effective option to bring these capabilities to the neuroscientist's desktop. Rather than port existing algorithms onto this architecture, we advocate the need for algorithm transformation, i.e., rethinking the design of the algorithm in a way that need not necessarily mirror its serial implementation strictly. We present a novel implementation of a frequent episode discovery algorithm by revisiting "in-the-large" issues such as problem decomposition as well as "in-the-small" issues such as da...
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as... more
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as being unique to each iris. Then the back propagation neural network (BPNN) is used as a classifier. In this system, the BPNN parallel algorithms and their implementation on GPUs have been used by the aid of CUDA in order to speed up the learning process. Finally, the system performance and the speeding outcomes in a way that this algorithm is done in series are presented.
This paper presents a comparison between two parallel architectures: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance comparison of... more
This paper presents a comparison between two parallel architectures: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance comparison of the two architectures. However, there is not some complete and recent paper that highlights clearly which architecture can be considered the most efficient. Thus, the goal is to make a comparison only in level of hardware, software, technological trends and ease of use, highlighting one that may present the best performance in general. To this end, we describe the main works that have used at least one of the architectures. It was observed that the choice of OpenCL may seem more obvious for being a heterogeneous system. Nevertheless, it was concluded that CUDA, although it can be used only in graphics cards from NVIDIA, has been a reference and more used recently.
This paper contains the overview of various parallelization techniques to improve the performance of existing data mining algorithms and make the capable of handling large amount of data. There are variety of techniques to achieve the... more
This paper contains the overview of various parallelization techniques to improve the performance of existing data mining algorithms and make the capable of handling large amount of data. There are variety of techniques to achieve the parallelization in data mining field, in this paper a brief introduction to few of the popular techniques is presented. The second part of this paper contains information regarding various data algorithms that are proposed by various authors based on these techniques. In Introduction various results corresponding to a survey are provided.
This paper presents a comparison between two architectures for parallel computing: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance... more
This paper presents a comparison between two architectures for parallel computing: Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). Some works in the literature have presented a computational performance comparison of the two architectures. However, there is not any complete and recent paper that highlights clearly which architecture can be considered the most efficient. The goal of this work is to make a comparison only in hardware and software levels, technological trends and ease of use highlighting the one that may present the best cost-effective in general. In order to it, we describe the main works that have used at least one of the architectures. It was observed that the choice of OpenCL may seem more obvious for being a heterogeneous system. Nevertheless, we concluded that CUDA, although being used only in graphics cards from NVIDIA®, has been a reference and more used recently.
Preventing users from accessing adult videos and at the same time allowing them to access good educational videos and other materials through campus wide network is a big challenge for organizations. Major existing web filtering systems... more
Preventing users from accessing adult videos and at the same time allowing them to access good educational videos and other materials through campus wide network is a big challenge for organizations. Major existing web filtering systems are textual content or link analysis based. As a result, potential users cannot access qualitative and informative video content which is available online. Adult content detection in video based on motion features or skin detection requires significant computing power and time. Judgment to identify pornography videos is taken based on processing of every chunk from video, consisting specific number of frames, sequentially one after another. This solution is not feasible in real time when user has started watching the video and decision about blocking needs to be taken within few seconds. In this paper, we propose a model where user is allowed to start watching any video; at the backend porn detection process using extracted video and image features shall run on distributed nodes with multiple GPUs (Graphics Processing Units). The video is processed on parallel and distributed platform in shortest time and decision about filtering the video is taken in real time. Track record of blocked content and websites is cached, too. For every new video downloads, cache is verified to prevent repetitive content analysis. On the fly blocking is feasible due to latest GPU architecture, CUDA (Compute Unified Device Architecture) and CUDA aware MPI (Message Passing Interface). It is possible to achieve coarse grained as well as fine grained parallelism. Video Chunks are processed parallel on distributed nodes. Porn detection algorithm on frames of chunks of videos can also achieve parallelism using GPUs on single node. It ultimately results into blocking porn video on the fly and allowing educational and informative videos.
Basándose en el código del TP de CS267 de la Universidad de Berkeley, se busca evaluar la performance de este algoritmo de simulación de partículas en Serie, OpenMP y CUDA. Se compararon los resultados y se sacaron conclusiones.
RSA is one of the most popular Public Key Cryptography based algorithm mainly used for digital signatures, encryption/decryption etc. It is based on the mathematical scheme of factorization of very large integers which is a... more
RSA is one of the most popular Public Key Cryptography based algorithm mainly used for digital signatures, encryption/decryption etc. It is based on the mathematical scheme of factorization of very large integers which is a compute-intensive process and takes very long time as well as power to perform.Several scientists are working throughout the world to increase the speedup and to decrease the power consumption of RSA algorithm while keeping the security of the algorithm intact. One popular technique
which can be used to enhance the performance of RSA is parallel programming. In this paper we are presenting the survey of various parallel implementations of RSA algorithm involving variety of hardware and software implementations.
Abstract: Image Retrieval instruments can aid individuals in making productive utilization of computerized picture accumulations; likewise it has gotten to be basic to discover productive techniques for the retrieval of these pictures.... more
Abstract: Image Retrieval instruments can aid individuals in making productive utilization of computerized picture accumulations; likewise it has gotten to be basic to discover productive techniques for the retrieval of these pictures. Most picture preparing calculations are innately parallel, so multithreading processors are suitable in such applications. In huge picture databases, picture preparing takes long time for run on a solitary(single core processor) center processor due to single string execution of calculations. GPU is more basic in most picture transforming applications because of multithread execution of calculations, programmability and minimal effort. In this paper we execute shading minutes and surface based picture Retrieve (entropy, standard deviation and neighborhood range) in parallel utilizing CUDA programming model to run on GPUs. These features are connected to inquiry pictures from a database which is like an inquiry picture. We assessed our image retrieve framework utilizing review, exactness, and normal accuracy measures. Exploratory results demonstrated that parallel execution prompted a normal accelerate of 144.67×over the serial execution when running on a NVIDIA GPU GeForce GT610M. Additionally the normal exactness and the normal review of proposed strategy are 61.968% and 55% individually. Keywords: CBIR, Color moments, CUDA, GPU. Title: Acceleration of Image Retrieval System Using CUDA Based Parallel Computing On GPU Author: Shingade Suraj, Patil Ganesh, Bhamare Sameer, Kumar Saurabh International Journal of Computer Science and Information Technology Research ISSN 2348-1196 (print), ISSN 2348-120X (online) Research Publish Journals
ABSTRACT The 2010 MEMOCODE Hardware Software Co-design challenge is to implement a Deep Packet Inspection architecture, called the CANSCID - Combined Architecture for Stream Categorization and Intrusion Detection. In this short paper, we... more
ABSTRACT The 2010 MEMOCODE Hardware Software Co-design challenge is to implement a Deep Packet Inspection architecture, called the CANSCID - Combined Architecture for Stream Categorization and Intrusion Detection. In this short paper, we present the design details of our submission, that utilizes a Graphical Processing Unit (GPU) to accelerate the parallel regular expression matching. The target line rate of 500 Mbps is met on all of the 25 mandatory and 10 optional patterns. The design is developed using the NVIDIA CUDA framework and tested on the Tesla GPU.
Unlike traditional physical sports, Esport games are played using wholly digital platforms. As a consequence, there exists rich data (in-game, audio and video) about the events that take place in matches. These data offer viable... more
Unlike traditional physical sports, Esport games are played using wholly digital platforms. As a consequence, there exists rich data (in-game, audio and video) about the events that take place in matches. These data offer viable linguistic resources for generating comprehensible text descriptions of matches, which could, be used as the basis of novel text-based spectator experiences. We present a study that investigates if users perceive text generated by the NLG system as an accurate recap of highlight moments. We also explore how the text generated supported viewer understanding of highlight moments in two scenarios: i) text as an alternative way to spectate a match, instead of viewing the main broadcast; and ii) text as an additional information resource to be consumed while viewing the main broadcast. Our study provided insights on the implications of the presentation strategies for use of text in recapping highlight moments to Dota 2 spectators.
In computer graphics, image interpolation is the process of resizing a digital image. Interpolation is a non-trivial process that involves enhancement in sense of efficiency, smoothness and sharpness. With bitmap graphics, as the size of... more
In computer graphics, image interpolation is the process of resizing a digital image. Interpolation is a non-trivial process that involves enhancement in sense of efficiency, smoothness and sharpness. With bitmap graphics, as the size of an image is enlarged, the pixels that form the image become increasingly visible, making the image appear "soft" if pixels are averaged, or jagged if not. Image interpolation methods however, often suffer from high computational costs and unnatural texture interpolation. This work intends to give an overview of image Interpolation, its uses & techniques. The paper work is an implementation of Image interpolation of images.
—Graphics Processing Units (GPU) allow for running massively parallel applications offloading the Central Processing Unit (CPU) from computationally intensive resources. However GPUs have a limited amount of memory. In this paper, a trie... more
—Graphics Processing Units (GPU) allow for running massively parallel applications offloading the Central Processing Unit (CPU) from computationally intensive resources. However GPUs have a limited amount of memory. In this paper, a trie compression algorithm for massively parallel pattern matching is presented demonstrating 85% less space requirements than the original highly efficient parallel failure-less Aho-Corasick, whilst demonstrating over 22 Gbps throughput. The algorithm presented takes advantage of compressed row storage matrices as well as shared and texture memory on the GPU.
The problem of variable selection is the selection of attributes for a given sample that best contribute to the prediction of the property of interest. Traditional algorithms as Successive Projections Algorithm (APS) have been quite used... more
The problem of variable selection is the selection of attributes for a given sample that best contribute to the prediction of the property of interest. Traditional algorithms as Successive Projections Algorithm (APS) have been quite used for variable selection in multivariate calibration problems. Among the bio-inspired algorithms, we note that the Firefly Algorithm (AF) is a newly proposed method with potential application in several real world problems such as variable selection problem. The main drawback of these tasks lies in them computation burden, as they grow with the number of variables available. The recent improvements of Graphics Processing Units (GPU) provides to the algorithms a powerful processing platform. Thus, the use of GPUs often becomes necessary to reduce the computation time of the algorithms. In this context, this work proposes a GPU-based AF (AF-RLM) for variable selection using multiple linear regression models (RLM). Furthermore, we present two APS implementations, one using RLM (APS-RLM) and the other sequential regressions (APS-RS). Such implementations are aimed at improving the computational efficiency of the algorithms. The advantages of the parallel implementations are demonstrated in an example involving a large number of variables. In such example, gains of speedup were obtained. Additionally we perform a comparison of AF-RLM with APS-RLM and APS-RS. Based on the results obtained we show that the AF-RLM may be a relevant contribution for the variable selection problem.
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as... more
In this paper, we seek a new method in designing an iris recognition system. In this method, first the Haar wavelet features are extracted from iris images. The advantage of using these features is the high-speed extraction, as well as being unique to each iris. Then the back propagation neural network (BPNN) is used as a classifier. In this system, the BPNN parallel algorithms and their implementation on GPUs have been used by the aid of CUDA in order to speed up the learning process. Finally, the system performance and the speeding outcomes in a way that this algorithm is done in series are presented.
This paper proposes a partial parallelization for the Successive Projections Algorithm (SPA), which is a variable selection technique designed for use with Multiple Linear Regression. This implementation is aimed at improving the... more
This paper proposes a partial parallelization for the Successive Projections Algorithm (SPA), which is a variable selection technique designed for use with Multiple Linear Regression. This implementation is aimed at improving the computational efficiency of SPA, without changing the outcome of the algorithm. For this purpose, a new strategy of inverse matrix calculation is employed. The advantage of the proposed implementation is demonstrated in an example involving large matrixes. In this example, gains of speedup were obtained.