Ijesat 2012 02 01 15
Ijesat 2012 02 01 15
Ijesat 2012 02 01 15
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
Department of CSE, Lakireddy Bali Reddy College of Engineering, Mylavaram, India, krishna4474@gmail.com 2 Director, Lakireddy Bali Reddy College of Engineering, Mylavaram, India
Abstract
Particle Swarm Optimization (PSO) may be easy but powerful optimization algorithm relying on the social behavior of the particles. PSO has become popular due to its simplicity and its effectiveness in wide range of application with low computational cost. The main objective of this paper is to implement a parallel Asynchronous version and Synchronous versions of PSO on the Graphical Processing Unit (GPU) and compare the performance in terms of execution time and speedup with their sequential versions on the GPU. We also present the Implementation details and Performance observations of parallel PSO algorithms on GPU using Compute Unified Device Architecture (CUDA), a software platform from nVIDIA. We observed that the Asynchronous version of the algorithm outperforms other versions of the algorithm.
2. PSO OVERVIEW
This section provides four versions of PSO algorithm with their advantages and disadvantages. First basics of PSO algorithm are presented followed by standard PSO later with Synchronous and Asynchronous PSO Algorithm.
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
updating, that is chosen from its left, right neighbors and itself. we tend to decision this a neighborhood or ring topologys, as shown in Figure 2.1(b). (Assuming that the swarm has a population of 12). 2) Inertia Weight and Constriction: In PSO, an inertia weight parameter was designed to regulate the influence of the previous particle velocities on the optimization method. By adjusting the worth of w, the swarm encompasses a bigger tendency to eventually constrict itself right down to area containing the most effective fitness and explore that area in detail. The same as the parameter w, SPSO introduced a replacement parameter called the constriction issue that springs from the present constants within the velocity update equation: = where = 1+ +2 and the velocity updating formula in SPSO is: ( +1)= ( ( )+ 1 1 + 2 2 (3) Where is not any longer global best however the local best. Statistical tests have shown that compared to PSO, SPSO will come higher results, whereas retaining the simplicity of PSO. The introduction of SPSO will provide researchers a typical grounding to figure from. SPSO are often used as a method of comparison for future developments and enhancements of PSO.
(t)
( ))+
2 2
( ) (1) (2)
where i =1, 2, ...N, N indicates the number of particles in the swarm namely the population. d =1, 2, ...D, D is the dimension of solution space. In Equations (1) and (2), the learning factors 1 and 2 are nonnegative constants, 1 and 2 are random numbers consistently distributed in the interval [0, 1], [ , ] where is also a chosen most velocity that's a relentless preset consistent with the objective optimization perform. If the rate on one dimension exceeds the foremost, it will be set to . This parameter controls the convergence rate of the PSO and may stop the strategy from growing too quick. The parameter w is that the inertia weight used to balance the planet and native search skills that could be a constant within the interval [0, 1]. ( +1) is that the position of the particle at time (t +1) , ( ) is that the bestfitness position reached by the particle up to time t (also termed personal attractor), is that the best-fitness purpose ever found by the full swarm (social attractor). Despite its simplicity, PSO is understood to be quite sensitive to the selection of its parameters. Underneath bound conditions, though, it are often proved that the swarm reaches a state of equilibrium, where particles converge onto a weighted average of their personal best and international best positions.
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY social attractor is barely updated at the tip of every generation, when the fitness values of all particles within the swarm are known. The sequence of steps for asynchronous PSO is shown in Figure 2.2.
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
not take full advantage of the GPU power in evaluating the fitness function in parallel. The parallelization only occurs on the number of particles of a swarm and ignores the dimensions of the function. In the parallel implementations the thread parallelization used as fine-grained as possible [1], in other words, all independent sequential parts of the code are allowed to run simultaneously in separate threads. However, the performance of an implementation does not only depend on the design choices, but also on the GPU architecture, data access scheme and layout, and the programming model, which in this case is CUDA. Therefore, it seems appropriate to outline the CUDA architecture and introduce some of its terminology.
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY therefore access to global memory should be minimized within a kernel. The design and implementation issues of our algorithms are presented in the following sections.
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
In contrast to the synchronous version, all particle thread blocks must be executing simultaneously, i.e., no sequential scheduling of thread blocks to processing cores is employed, as there is no explicit point of synchronization of all particles. Two diagrams representing the parallel execution for both versions are shown in Figure 3.1. Having the swarm particles evolve independently not only makes the algorithm more biologically plausible, but it also does make the swarm more reactive to newly discovered minima/maxima [1]. The price to
Figure 3.1: Asynchronous CUDA-PSO: particles run in parallel independently (top). Synchronous CUDA-PSO: particles synchronize at the end of each kernel (bottom). be paid is a limitation in the number of particles in a swarm which must match the maximum number of thread blocks that a certain GPU can maintain executing in parallel. This is not such a relevant shortcoming, as one of PSO's nicest features is
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY its good search effectiveness; because of this, only a small number of particles (a few dozen) is usually enough for a swarm search to work, which compares very favorably to the number of individuals usually required by evolutionary algorithms to achieve good performance when highdimensional problems are tackled. Also, currently, parallel system processing chips are scaling according to Moores law, and GPUs are being equipped with more processing cores with the introduction of every new model.
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
The following implementations of PSO have been compared: 1. The sequential synchronous SPSO version modified to implement a two nearest- neighbors ring topology (cpu-syn). The sequential asynchronous PSO version uses stochastic star topology (cpu-asyn). 2. The synchronous three kernel version of GPU-SPSO (gpu-syn). 3. The asynchronous one kernel version of GPU-SPSO (gpu-asyn).
4. RESULTS
In this report, comparison of the performance of different versions of parallel PSO implementations and the sequential PSO implementations on a classical benchmark which comprised a set of functions which are often used to evaluate stochastic optimization algorithms has been presented. The goal was to compare different parallel PSO implementations with one another and with the sequential implementations, in terms of execution time and speed, while checking that the quality of results was not badly affected by the sequential implementations. So all parameters of the algorithm are kept equal in all tests, setting them to the standard values suggested in [7]: w =0.729 and C1 = C2 =2.000. Also, for the comparison to be as fair as possible, the SPSO was adapted by substituting its original stochastic-star topology with the same ring topology adopted in the parallel GPU-based versions. For the experiments, the parallel algorithms were developed using CUDA version 4.0. Tests were performed on graphic card (see Table 4.1 for detailed specifications). The sequential implementations was run on a PC powered by a 64-bits Intel(R) Core(TM) i3 CPU running at 2.27GHz.we pass two random integer numbers P1, P2 [0, M - D N] from CPU to GPU, then 2*D*N numbers can be drawn from array R starting at R(P1) and R(P2), respectively, instead of transporting 2DN numbers from CPU to GPU. TABLE I: MAJOR TECHNICAL FEATURES OF THE GPU USED FOR THE EXPERIMENTS. Model Name GeForce GTS250 GPU clock(GHz) 1.62 StreamMulti Processors 16 CUDA cores 128 Bus width (bit) 256 Memory (MB) 1024 Memory clock (MHz) 1000.0 CUDA compute 1.1 capability TABLE 4.2 BENCHMARK TEST FUNCTIONS
Performance Metric:
Computational cost (C) also known as execution time is defined as the processing time (in seconds) that the PSO algorithm consumes. Computational throughput (V) is defined as the inverse of the computational cost: =1/ Speedup(S): measures the reached execution time improvement. It is a rate that evaluates how rapid the variant of interest is in comparison with the variant of reference. = Where is the throughput of the parallel implementation under study, and is the throughput of the reference implementation, i.e. the sequential implementation. The code was tested on the standard benchmark functions shown in Table 4.2. And the results are floated in the graphs shown in the Figures 4.1 Sphere Function, 4.2 Rastrigrin function, 4.3 Rosenbrock function, and 4.4 Griewank Function. For each function, the following comparisons have been made;
Name Equation Bounds (100,100 ) =1x ( 5.12,5.1 2) (30,30) (600,600 ) Initial Bound s ( 50,100) ( 2.56,5. 12) (15,30) Optim um 0.0D
=1
[ 210cos(2 i)+10]
0.0D
0.0D
=1+1]
(600, 600)
0.0D
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
Figure 4.1: execution time and speedup vs. swarm population size and problem dimensions for Sphere function
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
Figure 4.2: execution time and speedup vs. swarm population size and problem dimensions for Rastrigrin function
Figure 4.3: execution time and speedup vs. swarm population size and problem dimensions for Rosenbrock function
1. Keeping Problem dimension as constant a. Execution time vs. swarm population size b. Achieved speedup vs. swarm population
2. Keeping Swarm population size as constant a. Execution time vs. problem dimension b. Achieved speedup vs. problem dimension
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
Figure 4.4: execution time and speedup vs. swarm population size and problem dimensions for Griewank Function
In general, the asynchronous version was much faster than the synchronous version. The asynchronous version allows the social attractors to be updated immediately after evaluating each particles fitness, which causes the swarm to move more promptly towards newly-found optima. From the Figures 4.1, 4.2, 4.3 and 4.4, it is clear that the GPU-asynchronous version taking less execution time than others. The reason behind the unexpected behavior of the sequential algorithm regarding execution time, which appears to be nonmonotonic with problem dimension, showing a surprising decrease as problem dimension becomes larger. In fact, code optimization (or the hardware itself) might lead several multiplications to be directly equaled to zero without even performing them, as soon as the sum of the exponents of the two factors is below the precision threshold; a similar though opposite consideration can be made for additions and the sum of the exponents. One observation that adding up terms all of comparable magnitude is much slower than adding the same number of terms on very different scales. It is also worth noticing that the execution time graphs are virtually identical for the functions taken into consideration, which shows that GPUs are extremely effective at computing
arithmetic-intensive functions, mostly independently of the set of operators used, and that memory allocation issues are prevalent in determining performance. The asynchronous version of GPU-SPSO algorithm was able to significantly reduce execution time with respect not only to the sequential versions but also to synchronous version of GPU-SPSO. Depending on the degree of parallelization allowed by the fitness functions we considered, the asynchronous version of GPU-SPSO could reach speed-ups ranging from 25 (Rosenbrock, Griewank) to over 75 (Rastrigin) with respect to the sequential implementation, and often of more than one order of magnitude with respect to the corresponding GPU-based 3-kernel synchronous version. From the results, one can also notice that the best performances were obtained on the Rastrigrin function. This is presumably as results of the presence of advanced math functions in their definition. In fact, GPUs have internal fast math functions which can provide good computation speed at the cost of slightly lower accuracy, which causes no problems in this case.
V. KRISHNA REDDY* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 92 100
It was observed from the results that asynchronous version of GPU-SPSO algorithm was able to significantly reduce execution time with respect not only to the sequential versions but also to synchronous version of GPU-SPSO. Depending on the degree of parallelization allowed by the fitness functions we considered, the asynchronous version of CUDA-PSO could reach speed-ups of up to about 75 (in the tests with the highest-dimensional Rastrigin functions) with respect to the sequential implementation, and often more than one order of magnitude with respect to the corresponding GPU-based 3kernel synchronous version, sometimes showing a limited, possibly only apparent, decrease of search performances. Hence asynchronous GPU-SPSO is preferable to optimize functions with more complex arithmetic, with the same swarm population, dimension and number of iterations. Furthermore, function with more complex arithmetic has a higher speed up. For problems with large swarm population or higher dimensions, the asynchronous GPU-PSO will provide better performance. Since most display card in current common PC has GPU chips, more researchers can make use of this parallel asynchronous GPU-SPSO to solve their practical problems. Future Scope will include updating and extending this asynchronous GPU-SPSO to the applications of PSO and improving the performance. Other interesting developments may be offered by the availability of OpenCL, which will allow owners of different GPUs (as well as multi-core CPUs, which are also supported) than nVIDIAs to implement parallel algorithms on their own computing architectures. The availability of shared code which allows for optimized code parallelization even on more traditional multi-core CPUs will make the comparison between GPU-based and multi-core CPUs easier (and, possibly, fairer) besides allowing for a possible optimized hybrid use of computing resources in modern computers.
Zhou, Y.; Tan, Y., Particle swarm optimization with triggered mutation and its implementation based on GPU, Proceedings of the 12th Annual Genetic and Evolutionary Computation Conference GECCO10, 2010, pp. 10071014. Venter, G.; Sobieski, J., A parallel particle swarm optimization algorithm accelerated by asynchronous evaluations, 6th World Congresses of Structural and Multidisciplinary Optimization, 2005. Bratton, D.; Kennedy, J., Defining a Standard for Particle Swarm Optimization, IEEE Swarm Intelligence Symposium, April 2007, pp.120-127. Wilkinson, B., General Purpose Computing using GPUs: Developing a hands-on undergraduate course on CUDA programming, 42nd ACM Technical Symposium on Computer Science Education, SIGCSE 2010, Dallas, Texas, USA , March 9-12, 2010.
[6]
[7]
[8]
REFERENCES
[1]
Mussi, L.; Nashed, Y.S.G.; Cagnoni, S., GPU-based Asynchronous Particle Swarm Optimization. Proceedings of GECCO 11, 2011, pp. 1555-1562. Kennedy, J.; Eberhart, R., Particle Swarm Optimization, IEEE International Conference on Neural Networks, Perth, WA, Australia, Nov.1995, pp.19421948. You Zhou, Ying Tan, GPU based Parallel Particle Swarm Optimization, IEEE Congress on Evolutionary Computation, 2009, pp.1493-1500. nVIDIA Corporation, nVIDIA CUDA programming guide 3.2, October 2010.
[2]
[3]
[4]