Abstract
General-purpose computing on graphics processing unit (GPGPU) has been adopted to accelerate the running of applications which require long execution time in various problem domains. Tabu Search belonging to meta-heuristics optimization has been used to find a suboptimal solution for NP-hard problems within a more reasonable time interval. In this paper, we have investigated in how to improve the performance of Tabu Search algorithm on GPGPU and took the permutation flow shop scheduling problem (PFSP) as the example for our study. In previous approach proposed recently for solving PFSP by Tabu Search on GPU, all the job permutations are stored in global memory to successfully eliminate the occurrences of branch divergence. Nevertheless, the previous algorithm requires a large amount of global memory space, because of a lot of global memory access resulting in system performance degradation. We propose a new approach to address the problem. The main contribution of this paper is an efficient multiple-loop struct to generate most part of the permutation on the fly, which can decrease the size of permutation table and significantly reduce the amount of global memory access. Computational experiments on problems according with benchmark suite for PFSP reveal that the best performance improvement of our approach is about 100%, comparing with the previous work.
Similar content being viewed by others
References
Fung J, Tang F, Mann S (2002) Mediated reality using computer graphics hardware for computer vision. In: Proceedings of the International Symposium on Wearable Computing 2002, 83–89
Fung J, Mann S (2004) Computer vision signal processing on graphics processing units. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp V-93–V-96
Abi-Chahla F (2015) Nvidia’s CUDA: The End of the CPU?. Tom’s Hardware
Zouaneb I, Belarbi M, Chouarfia A (2016) Multi approach for real-time systems specification: case study of GPU parallel systems. Int J Big Data Intell 3(2):122–141
Playne DP, Hawick KA (2015) Benchmarking multi-GPU communication using the shallow water equations. Int J Big Data Intell 2(3):157–167
Wu CC, Ke JY, Lin H, Jhan SS (2014) Adjusting thread parallelism dynamically to accelerate dynamic programming with irregular workload distribution on GPGPUs. Int J Grid High Perform Comput (IJGHPC) 6(1):1–20
Novoa C, Qasem A, Chaparala A (2015) A SIMD tabu search implementation for solving the quadratic assignment problem with GPU acceleration. In: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, pp 13
Czapiński M, Barnes S (2011) Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J Parallel Distrib Comput 71:802–811
Johnson SM (1954) Optimal two- and three-stage production schedules with setup times included. Naval Res Logist Q 1(1):61–68
Garey MR, Johnson D, Sethi R (1976) The complexity of flowshop and jobshop scheduling. Math Oper Res 1(2):117–129
Chung C-S, Flynn J, Kirca O (2002) A branch and bound algorithm to minimize the total flow time for m-machine permutation flowshop problems. Int J Prod Econ 79(3):185–196
Bautista J, Canoa A, Companys R, Ribasb I (2012) Solving the Fm\(\mid \)block\(\mid \)C\(_{max}\) problem using bounded dynamic programming. Eng Appl Artif Intell 25(6):1235–1245
Ren T, Zhao P, Zhang D, Liu B, Yuan H, Bai D (2016) Permutation flow-shop scheduling problem to optimize a quadratic objective function. Eng Optim. doi:10.1080/0305215X.2016.1261127
Gangadharan R, Rajendran C (1993) Heuristic algorithms for scheduling in the no-wait flowshop. Int J Prod Econ 32(3):285–290
Santos N, Rebelo R, Pedroso J (2014) A tabu search for the permutation flow shop problem with sequence dependent setup times. Int J Data Anal Tech Strateg 6(3):275–285
Gao J, Chen R, Dong W (2013) An efficient tabu search algorithm for the distributed permutation flowshop scheduling problem. Int J Prod Res 51(3):641–651
Rajkumar R, Shahabudeen P (2009) An improved genetic algorithm for the flowshop scheduling problem. Int J Prod Res 47(1):233–249
Jarosław P, Czesław S, Dominik Ż (2013) Optimizing bicriteria flow shop scheduling problem by simulated annealing algorithm. Proc Comput Sci 18:936–945
Xu X, Xu Z, Gu X (2011) An asynchronous genetic local search algorithm for the permutation flowshop scheduling problem with total flowtime minimization. Expert Syst Appl 38(7):7970–7979
Banka M, Ghomia SMTF, Jolai F, Behnamian J (2012) Application of particle swarm optimization and simulated annealing algorithms in flow shop scheduling problem under linear deterioration. Adv Eng Softw 47(1):1–6
Ahmadiza F (2012) A new ant colony algorithm for makespan minimization in permutation flow shops. Comput Ind Eng 63(2):355–361
Bożejko W, Uchroński M, Wodeck M (2016) Parallel metaheuristics for the cyclic flow shop scheduling problem. Comput Ind Eng 95:156–163
Czapiński M (2010) Parallel simulated annealing with genetic enhancement for flowshop problem with C\(_{sum}\). Comput Ind Eng 59(4):778–785
Bożejko W (2009) Solving the flow shop problem by parallel programming. J Parallel Distrib Comput 69(5):470–481
Nowicki E, Smutnicki C (1998) The flow shop with parallel machines: a tabu search approach. Eur J Oper Res 106(2–3):226–253
Janiak A, Janiak WA, Lichtenstein M (2008) Tabu Search on GPU. J UCS 14(14):2416–2426
Kaviani M, Abbasi M, Rahpeyma B, Yusefi M (2014) A hybrid tabu search-simulated annealing method to solve quadratic assignment problem. Decis Sci Lett 3(3):391–396
Pattnaik A, Tang X, Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Das CR (2016) Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp 31–44
Han TD, Abdelrahman TS (2011) Reducing branch divergence in GPU programs. In: Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, pp 1–8
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
Lu F, Song J, Cao X, Zhu X (2012) CPU/GPU computing for long-wave radiation physics on large GPU clusters. Comput Geosci 41:47–55
Nvidia CUDA (2017) CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Nvidia CUDA (2017) CUDA C BEST PRACTICES GUIDE. http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
Liu Y-F, Liu S-Y (2011) A hybrid discrete artificial bee colony algorithm for permutation flowshop scheduling problem. Appl Soft Comput 13(3):1459–1463
Lin Q, Gao L, Li X, Zhang C (2015) A hybrid backtracking search algorithm for permutation flow-shop. Comput Ind Eng 85:437–446
Glover F (1989) Tabu search—part I. ORSA J Comput 1(3):190–206
Glover F (1990) Tabu search—part II. ORSA J Comput 2(1):4–32
Huang L-T, Jhan S-S, Li Y-J, Wu C.C (2014) Solving the permutation problem efficiently for tabu search on CUDA GPUs. In: Proceedings of 6th International Conference on Computational Collective Intelligence Technologies and Applications, pp 342–352
Wu C-C, Wei K-C, Lai W-S, Li Y-J (2016) Avoiding duplicated computation to improve the performance of PFSP on CUDA GPUs. Comput Sci Inform Technol 6:13–23
Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp 407–420
Taillard E (1993) Benchmarks for basic scheduling problems. Eur J Oper Res 64(2):278–285
Acknowledgements
We would like to express our gratitude for reviewers’ valuable comments and thank the National Science Council, Taiwan, for financially supporting this research under Contract No. MOST104-2221-E-018-007.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
This appendix argue that the maximum number of block areas is five for each permutation segment table.
The permutation table is constructed as follows. Each permutation is generated by swapping two positions on the parent permutation, resulting in \(C_2^n \) child permutations totally. These child permutations are placed into the permutation table by ordering defined in Table 4 .
The two indices, “From” and “To” indicate that two positions on the parent permutation are swapped for one child permutation, where \(1\le From\le n-1\) and \(2\le From\le n\). Note that “From” is smaller than “To” for any one of the child permutations.
The permutation table can be divided into (\(n-1)\) groups from left to right, where each group has the same “From” value. For instance, in the 7\(^{\mathrm{th}}\) group, the “From” index of each child permutation is 7. We illustrate the conceptual overview of the s\(^{\mathrm{th}}\) group in Fig. 19. There are exactly two shaded cells in each column, representing the two swapped positions.
The permutation table will be divided into segment tables, from the left to the right columns. Each permutation segment table consists of 32 consecutive columns because the size of one warp is 32.
First, assume that one permutation segment table falls in only one group of child permutations. There are three cases as shown in Fig. 20. Case 1 is derived when the permutation segment table (PST) begins from the first column of the group, where there are three block areas (BAs). Case 2 is obtained when the PST ends at the final column of the group, where there are five BAs. Case 3 is the remaining cases and there are four BAs. In general, there are five BAs in cases 2. However, if the PST in case 2 is equivalent to the s\(^{\mathrm{th}}\) group, there are only two BAs because BA3 and BA5 will not exist and BA4 will be merged into BA2.
Next, let us look at the cases when one PST includes multiple groups of child permutations, as shown in Fig. 21. Case 5 contains the last columns in the sth group and the first column in the (s+1)th group, where several rows, between Row sand Rown, have the same values in their own row, respectively. There are four BAs in cases 5. Case 4 demonstrates a general case when one PST is comprised of multiple groups. If there are more than two groups to form a PST, the rows with distinct data will be merged into BA2, resulting in two BAs totally.
According to the above analysis, the maximum number of BAs is five.
Rights and permissions
About this article
Cite this article
Wei, KC., Sun, X., Chu, H. et al. Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU. J Supercomput 73, 4711–4738 (2017). https://doi.org/10.1007/s11227-017-2041-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2041-7