Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Software testing is an essential phase in the software development life cycle. One of the important types of software testing is unit testing and its execution is time-consuming and costly. Using parallelization to speed up the testing execution is beneficial and productive for programmers. To parallelize test execution, researchers can use GPU machines. In GPU applications, multiple threads execute in parallel within a group known as a warp. Branch divergence affects the performance of a warp negatively when some threads run a branch, and the other threads are idle waiting for the first set of threads to finish their execution. In this paper, we propose a novel algorithm to minimize branch divergence when testing an application on a GPU. We arrange test inputs based on the warp size of a GPU machine. Test inputs that have similar control flow paths are grouped within the same warp executing in parallel. Thus, the branch divergence is minimized per warp. We validate and evaluate our algorithm on six benchmarks (57 programs in total). Our approach accelerates the testing execution by up to 3.8x and improves the warp execution efficiency by up to 15x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/tbagies/GPU-BranchDivergence.

  2. https://github.com/tbagies/GPU-BranchDivergence.

References

  1. Yaneva V, Rajan A, Dubach C (2017) Compiler-assisted test acceleration on gpus for embedded software. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 35–45. ACM, New York, NY, USA. https://doi.org/10.1145/3092703.3092720

  2. Harrold MJ (2000) Testing: A roadmap. In: Proceedings of the Conference on The Future of Software Engineering. ICSE ’00, pp. 61–72. ACM, New York, NY, USA. https://doi.org/10.1145/336512.336532

  3. Shete N, Jadhav A (2014) An empirical study of test cases in software testing. In: International Conference on Information Communication and Embedded Systems (ICICES2014), pp. 1–5. https://doi.org/10.1109/ICICES.2014.7033883

  4. Sommerville I (2015) Software Engineering, 10th edn. Pearson, ???

  5. Rothermel G, Untch RH, Chu C, Harrold MJ (2001) Prioritizing test cases for regression testing. IEEE Trans Software Eng 27(10):929–948. https://doi.org/10.1109/32.962562

    Article  Google Scholar 

  6. Gambi A, Kappler S, Lampel J, Zeller A (2017) Cut: Automatic unit testing in the cloud. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 364–367. ACM, New York, NY, USA

  7. Kappler S (2016) Finding and breaking test dependencies to speed up test execution. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. FSE 2016, pp. 1136–1138. ACM, New York, NY, USA. https://doi.org/10.1145/2950290.2983974

  8. Liu C-H, Chen S-L, Chen W-K (2017) Cost-benefit evaluation on parallel execution for improving test efficiency over cloud. In: 2017 International Conference on Applied System Innovation (ICASI), pp. 199–202. https://doi.org/10.1109/ICASI.2017.7988384

  9. Oriol M, Ullah F (2010) Yeti on the cloud. In: 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, pp. 434–437. https://doi.org/10.1109/ICSTW.2010.68

  10. Parveen T, Tilley S, Daley N, Morales P (2009) Towards a distributed execution framework for junit test cases. In: 2009 IEEE International Conference on Software Maintenance, pp. 425–428. https://doi.org/10.1109/ICSM.2009.5306292

  11. Gambi A, Gorla A, Zeller A (2017) O!snap: Cost-efficient testing in the cloud. In: 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 454–459. https://doi.org/10.1109/ICST.2017.51

  12. von Hof V, Fuchs A (2018) Automatic scalable parallel test case execution. introducing the münster distributed test case runner for java (midstr). In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1062–1064

  13. Koong C-S, Shih C-H, Wu C-C, Hsiung P-A (2013) The architecture of parallelized cloud-based automatic testing system. In: 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 467–470. https://doi.org/10.1109/CISIS.2013.85

  14. Duarte A, Cirne W, Brasileiro F, Machado P (2006) Gridunit: software testing on the grid. In: Proceedings of the 28th International Conference on Software Engineering, pp. 779–782

  15. Rajan A, Sharma S, Schrammel P, Kroening D (2014) Accelerated test execution using gpus. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. ASE ’14, pp. 97–102. ACM, New York, NY, USA. https://doi.org/10.1145/2642937.2642957

  16. Han TD, Abdelrahman TS (2011) Reducing branch divergence in gpu programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. GPGPU-4, pp. 3–138. ACM, New York, NY, USA. https://doi.org/10.1145/1964179.1964184

  17. Zhang EZ, Jiang Y, Guo Z, Shen X (2010) Streamlining gpu applications on the fly: Thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 115–126. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810104

  18. Yu Z, Eeckhout L, Xu C (2016) Thread similarity matrix: Visualizing branch divergence in gpgpu programs. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 179–184

  19. Coutinho B, Sampaio D, Pereira FMQ, Meira Jr W (2011) Divergence analysis and optimizations. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 320–329. IEEE

  20. Sampaio D, Martins R, Collange S, Pereira FMQ (2012) Divergence analysis with affine constraints. In: 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp. 67–74

  21. Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of ptx kernels, pp. 3–12. https://doi.org/10.1109/IISWC.2009.5306801

  22. Sartori J, Kumar R (2013) Branch and data herding: reducing control and memory divergence for error-tolerant gpu applications. IEEE Trans Multimedia 15(2):279–290

    Article  Google Scholar 

  23. Vespa LVL (2018) Unraveling the divergence of gpu threads. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1398–1403

  24. Chakroun I, Mezmaz M, Melab N, Bendjoudi A (2013) Reducing thread divergence in a gpu-accelerated branch-and-bound algorithm. Concurr Comput Pract Exp 25(8):1121–1136

    Article  Google Scholar 

  25. Li Y, Liu R (2016) High throughput gpu polar decoder. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 1123–1127

  26. Carrillo S, Siegel J, Li X (2009) A control-structure splitting optimization for gpgpu. In: Proceedings of the 6th ACM Conference on Computing Frontiers, pp. 147–150

  27. Reissmann N, Falch TL, Bjørnseth BA, Bahmann H, Meyer JC, Jahre M (2016) Efficient control flow restructuring for gpus. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 48–57

  28. Anantpur J, Govindarajan R (2014) Taming control divergence in gpus through control flow linearization. In: International Conference on Compiler Construction, pp. 133–153. Springer

  29. Zone ND (2021) CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture. Accessed on 23 June 2021

  30. Gupta P (2020) CUDA Refresher: The CUDA Programming Model. https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/. Accessed on 27 June 2021

  31. Han TD, Abdelrahman TS (2011) hicuda: High-level gpgpu programming. IEEE Trans Parallel Distrib Syst 22(1):78–90. https://doi.org/10.1109/TPDS.2010.62

    Article  Google Scholar 

  32. Lin Y, Grover V (2012) Using CUDA warp-level primitives. https://devblogs.nvidia.com/using-cuda-warp-level-primitives/. Accessed on 14 Oct 2018

  33. Workshop V (2019) Introduction to GPGPU and CUDA programming: thread divergence. https://cvw.cac.cornell.edu/gpu/thread_div. Accessed on 20 Aug 2019

  34. Srivastava A, Thiagarajan J (2002) Effectively prioritizing tests in development environment. In: Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 97–106

  35. Wong WE, Horgan JR, London S, Agrawal H (1997) A study of effective regression testing in practice. In: Proceedings The Eighth International Symposium on Software Reliability Engineering. pp 264–274

  36. Beller M, Gousios G, Panichella A, Zaidman A (2015) When, how, and why developers (do not) test in their ides. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ESEC/FSE 2015, pp. 179–190. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2786805.2786843

  37. Rothermel G, Untch RH, Chu C, Harrold MJ (1999) Test case prioritization: An empirical study. In: Proceedings IEEE International Conference on Software Maintenance-1999 (ICSM’99).’Software Maintenance for Business Change’(Cat. No. 99CB36360), pp. 179–188. IEEE

  38. Zhang S, Jalali D, Wuttke J, Muşlu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp. 385–396

  39. Lam W, Zhang S, Ernst MD (2015) When tests collide: evaluating and coping with the impact of test dependence. University of Washington Department of Computer Science and Engineering, Tech, Rep

  40. Schwahn O, Coppik N, Winter S, Suri N (2019) Assessing the state and improving the art of parallel testing for c. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 123–133

  41. Hu H, Jiang C-H, Ye F, Cai K-Y, Huang D, Yau SS (2010) A parallel implementation strategy of adaptive testing. In: 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops, pp. 214–219. https://doi.org/10.1109/COMPSACW.2010.44

  42. Misailovic S, Milicevic A, Petrovic N, Khurshid S, Marinov D (2007) Parallel test generation and execution with korat. In: Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp. 135–144

  43. Siddiqui JH, Khurshid S (2009) Pkorat: Parallel generation of structurally complex test inputs. In: 2009 International Conference on Software Testing Verification and Validation, pp. 250–259. https://doi.org/10.1109/ICST.2009.48

  44. Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 407–420

  45. Brunie N, Collange S, Diamos G (2012) Simultaneous branch and warp interweaving for sustained gpu performance. In: 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49–60

  46. Rhu M, Erez M (2012) Capri: prediction of compaction-adequacy for handling control-divergence in gpgpu architectures. ACM SIGARCH Comput Arch News 40(3):61–71

    Article  Google Scholar 

  47. Rhu M, Erez M (2013) Maximizing simd resource utilization in gpgpus with simd lane permutation. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. pp. 356–367

  48. Fung WWL, Aamodt TM (2011) Thread block compaction for efficient simt control flow. In: Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. HPCA ’11, pp. 25–36. IEEE Computer Society, USA

  49. Li B, Wei J, Guo W, Sun J (2015) Improving simd utilization with thread-lane shuffled compaction in gpgpu. Chin J Electron 24:684–688. https://doi.org/10.1049/cje.2015.10.004

    Article  Google Scholar 

  50. Yang H, Chen S, Wan J, Xu X (2015) Divergent branch threads compaction for efficient simd control flow. Chin J Electron 24(2):288–294

    Article  Google Scholar 

  51. Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving gpu performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-44, pp. 308–317. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2155620.2155656

  52. Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture. pp. 235–246

  53. Tarjan D, Meng J, Skadron K (2009) Increasing memory miss tolerance for simd cores. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–11

  54. Emmery C (2017) Euclidean vs. cosine distance. https://cmry.github.io/notes/euclidean-v-cosine. Accessed on 27 June 2021

  55. Ladd JR (2020) Understanding and using common similarity measures for text analysis. https://programminghistorian.org/en/lessons/common-similarity-measures. Accessed on 11 Jul 2021

  56. Nvidia (2018) cuda-c-programming-guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability. Accessed on 28 Jul 2018

  57. die.net (2011) clock_gettime(3); Linux man page. https://linux.die.net/man/3/clock_gettime. Accessed on 16 Oct 2018

  58. Yang C-T, Huang C-L, Lin C-F (2011) Hybrid cuda, openmp and mpi parallel programming on multicore gpu clusters. Comput Phys Commun. 182(1):266–269. https://doi.org/10.1016/j.cpc.2010.06.035. Computer Physics Communications Special Edition for Conference on Computational Physics Kaohsiung, Taiwan, Dec 15–19, 2009

  59. Harris M (2011) How to Implement Performance Metrics in CUDA C/C++. https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/. Accessed on 16 Oct 2018

  60. NVIDIA (2012) NVIDIA CUDA C Programming Guide. https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf. Accessed on 26 Jan 2019

  61. Pouchet LN (2016) PolyBench/C 3.2. http://polybench.sourceforge.net. Accessed on 14 Oct 2018

  62. Kalliamvakou E, Damian D, Blincoe K, Singer L, German DM (2015) Open source-style collaborative development practices in commercial projects using github. In: Proceedings of the 37th International Conference on Software Engineering: Volume 1. ICSE ’15, pp. 574–585. IEEE Press, Piscataway, NJ, USA. http://dl.acm.org/citation.cfm?id=2818754.2818825

  63. Quijada M (2014) image-manipulation-in-c. https://github.com/mauryquijada/image-manipulation-in-c.git. Accessed on 29 Jul 2018

  64. Yerburgh E (2017) c-sorting-algorithms. https://github.com/eddyerburgh/c-sorting-algorithms/tree/master/algorithms. Accessed on 17 Aug 2019

  65. Felipe L (2018) VAR-solutions. https://github.com/luizok/GraphAlgorithms/blob/master/graphalgs.c. Accessed on 14 Jan 2020

  66. Varshney R (2018) VAR-solutions. https://github.com/VAR-solutions/Algorithms/tree/dev/Dynamic%20Programming. Accessed on 14 Jan 2020

  67. Corporation N (2019) Nsight Compute CLI. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed on 4 Sep 2019

  68. NVIDIA (2018) NVIDIACUDA Toolkit Documentation. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference. Accessed on 14 Oct 2018

  69. Corporation N (2019) Nsight Compute CLI-5.3. Metric Comparison. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison. Accessed on 4 Sept 2019

  70. Family SP (2021) Sonargraph-Architect. https://www.hello2morrow.com/products/sonargraph/architect9. Accessed on 16 Nov 2021

  71. Whitehead N, Fit-Florea A (2011) Precision & performance: floating point and ieee 754 compliance for nvidia gpus. rn (A+ B) 21(1):18749–19424

  72. NVIDIA (2017) NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed on 28 Apr 2023

  73. Repository S-aI: SIR Usage Information. https://sir.csc.ncsu.edu/portal/usage.php. Accessed on 3 Apr 2019

  74. Bagies T, Jannesari A (2021) An empirical study of parallelizing test execution using cuda unified memory and openmp gpu offloading. In: 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 271–278. https://doi.org/10.1109/ICSTW52544.2021.00052

  75. Zhang L (2018) Hybrid regression test selection. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 199–209

  76. Marijan D, Liaaen M (2018) Practical selective regression testing with effective redundancy in interleaved tests. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 153–162

Download references

Acknowledgements

We would like to thank King Abdulaziz University, Jeddah, Saudi Arabia for supporting the Ph.D. scholarship to work on research.

Funding

The authors declare no funding.

Author information

Authors and Affiliations

Authors

Contributions

TB wrote the main manuscript text and prepared all figures and tables. AJ reviewed the manuscript

Corresponding author

Correspondence to Taghreed Bagies.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bagies, T., Le, W., Sheaffer, J. et al. Reducing branch divergence to speed up parallel execution of unit testing on GPUs. J Supercomput 79, 18340–18374 (2023). https://doi.org/10.1007/s11227-023-05375-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05375-0

Keywords