Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Bagies, Taghreed; Le, Wei; Sheaffer, Jeremy; Jannesari, Ali

doi:10.1007/s11227-023-05375-0

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Published: 13 May 2023

Volume 79, pages 18340–18374, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Taghreed Bagies¹,
Wei Le²,
Jeremy Sheaffer² &
…
Ali Jannesari²

268 Accesses
Explore all metrics

Abstract

Software testing is an essential phase in the software development life cycle. One of the important types of software testing is unit testing and its execution is time-consuming and costly. Using parallelization to speed up the testing execution is beneficial and productive for programmers. To parallelize test execution, researchers can use GPU machines. In GPU applications, multiple threads execute in parallel within a group known as a warp. Branch divergence affects the performance of a warp negatively when some threads run a branch, and the other threads are idle waiting for the first set of threads to finish their execution. In this paper, we propose a novel algorithm to minimize branch divergence when testing an application on a GPU. We arrange test inputs based on the warp size of a GPU machine. Test inputs that have similar control flow paths are grouped within the same warp executing in parallel. Thus, the branch divergence is minimized per warp. We validate and evaluate our algorithm on six benchmarks (57 programs in total). Our approach accelerates the testing execution by up to 3.8x and improves the warp execution efficiency by up to 15x.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Block-Size Independence for GPU Programs

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

CLTestCheck: Measuring Test Effectiveness for GPU Kernels

Notes

References

Yaneva V, Rajan A, Dubach C (2017) Compiler-assisted test acceleration on gpus for embedded software. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 35–45. ACM, New York, NY, USA. https://doi.org/10.1145/3092703.3092720
Harrold MJ (2000) Testing: A roadmap. In: Proceedings of the Conference on The Future of Software Engineering. ICSE ’00, pp. 61–72. ACM, New York, NY, USA. https://doi.org/10.1145/336512.336532
Shete N, Jadhav A (2014) An empirical study of test cases in software testing. In: International Conference on Information Communication and Embedded Systems (ICICES2014), pp. 1–5. https://doi.org/10.1109/ICICES.2014.7033883
Sommerville I (2015) Software Engineering, 10th edn. Pearson, ???
Rothermel G, Untch RH, Chu C, Harrold MJ (2001) Prioritizing test cases for regression testing. IEEE Trans Software Eng 27(10):929–948. https://doi.org/10.1109/32.962562
Article Google Scholar
Gambi A, Kappler S, Lampel J, Zeller A (2017) Cut: Automatic unit testing in the cloud. In: Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ISSTA 2017, pp. 364–367. ACM, New York, NY, USA
Kappler S (2016) Finding and breaking test dependencies to speed up test execution. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. FSE 2016, pp. 1136–1138. ACM, New York, NY, USA. https://doi.org/10.1145/2950290.2983974
Liu C-H, Chen S-L, Chen W-K (2017) Cost-benefit evaluation on parallel execution for improving test efficiency over cloud. In: 2017 International Conference on Applied System Innovation (ICASI), pp. 199–202. https://doi.org/10.1109/ICASI.2017.7988384
Oriol M, Ullah F (2010) Yeti on the cloud. In: 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, pp. 434–437. https://doi.org/10.1109/ICSTW.2010.68
Parveen T, Tilley S, Daley N, Morales P (2009) Towards a distributed execution framework for junit test cases. In: 2009 IEEE International Conference on Software Maintenance, pp. 425–428. https://doi.org/10.1109/ICSM.2009.5306292
Gambi A, Gorla A, Zeller A (2017) O!snap: Cost-efficient testing in the cloud. In: 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 454–459. https://doi.org/10.1109/ICST.2017.51
von Hof V, Fuchs A (2018) Automatic scalable parallel test case execution. introducing the münster distributed test case runner for java (midstr). In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 1062–1064
Koong C-S, Shih C-H, Wu C-C, Hsiung P-A (2013) The architecture of parallelized cloud-based automatic testing system. In: 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 467–470. https://doi.org/10.1109/CISIS.2013.85
Duarte A, Cirne W, Brasileiro F, Machado P (2006) Gridunit: software testing on the grid. In: Proceedings of the 28th International Conference on Software Engineering, pp. 779–782
Rajan A, Sharma S, Schrammel P, Kroening D (2014) Accelerated test execution using gpus. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. ASE ’14, pp. 97–102. ACM, New York, NY, USA. https://doi.org/10.1145/2642937.2642957
Han TD, Abdelrahman TS (2011) Reducing branch divergence in gpu programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. GPGPU-4, pp. 3–138. ACM, New York, NY, USA. https://doi.org/10.1145/1964179.1964184
Zhang EZ, Jiang Y, Guo Z, Shen X (2010) Streamlining gpu applications on the fly: Thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10, pp. 115–126. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1810085.1810104
Yu Z, Eeckhout L, Xu C (2016) Thread similarity matrix: Visualizing branch divergence in gpgpu programs. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 179–184
Coutinho B, Sampaio D, Pereira FMQ, Meira Jr W (2011) Divergence analysis and optimizations. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 320–329. IEEE
Sampaio D, Martins R, Collange S, Pereira FMQ (2012) Divergence analysis with affine constraints. In: 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, pp. 67–74
Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of ptx kernels, pp. 3–12. https://doi.org/10.1109/IISWC.2009.5306801
Sartori J, Kumar R (2013) Branch and data herding: reducing control and memory divergence for error-tolerant gpu applications. IEEE Trans Multimedia 15(2):279–290
Article Google Scholar
Vespa LVL (2018) Unraveling the divergence of gpu threads. In: 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1398–1403
Chakroun I, Mezmaz M, Melab N, Bendjoudi A (2013) Reducing thread divergence in a gpu-accelerated branch-and-bound algorithm. Concurr Comput Pract Exp 25(8):1121–1136
Article Google Scholar
Li Y, Liu R (2016) High throughput gpu polar decoder. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 1123–1127
Carrillo S, Siegel J, Li X (2009) A control-structure splitting optimization for gpgpu. In: Proceedings of the 6th ACM Conference on Computing Frontiers, pp. 147–150
Reissmann N, Falch TL, Bjørnseth BA, Bahmann H, Meyer JC, Jahre M (2016) Efficient control flow restructuring for gpus. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 48–57
Anantpur J, Govindarajan R (2014) Taming control divergence in gpus through control flow linearization. In: International Conference on Compiler Construction, pp. 133–153. Springer
Zone ND (2021) CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture. Accessed on 23 June 2021
Gupta P (2020) CUDA Refresher: The CUDA Programming Model. https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/. Accessed on 27 June 2021
Han TD, Abdelrahman TS (2011) hicuda: High-level gpgpu programming. IEEE Trans Parallel Distrib Syst 22(1):78–90. https://doi.org/10.1109/TPDS.2010.62
Article Google Scholar
Lin Y, Grover V (2012) Using CUDA warp-level primitives. https://devblogs.nvidia.com/using-cuda-warp-level-primitives/. Accessed on 14 Oct 2018
Workshop V (2019) Introduction to GPGPU and CUDA programming: thread divergence. https://cvw.cac.cornell.edu/gpu/thread_div. Accessed on 20 Aug 2019
Srivastava A, Thiagarajan J (2002) Effectively prioritizing tests in development environment. In: Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 97–106
Wong WE, Horgan JR, London S, Agrawal H (1997) A study of effective regression testing in practice. In: Proceedings The Eighth International Symposium on Software Reliability Engineering. pp 264–274
Beller M, Gousios G, Panichella A, Zaidman A (2015) When, how, and why developers (do not) test in their ides. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ESEC/FSE 2015, pp. 179–190. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2786805.2786843
Rothermel G, Untch RH, Chu C, Harrold MJ (1999) Test case prioritization: An empirical study. In: Proceedings IEEE International Conference on Software Maintenance-1999 (ICSM’99).’Software Maintenance for Business Change’(Cat. No. 99CB36360), pp. 179–188. IEEE
Zhang S, Jalali D, Wuttke J, Muşlu K, Lam W, Ernst MD, Notkin D (2014) Empirically revisiting the test independence assumption. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis, pp. 385–396
Lam W, Zhang S, Ernst MD (2015) When tests collide: evaluating and coping with the impact of test dependence. University of Washington Department of Computer Science and Engineering, Tech, Rep
Schwahn O, Coppik N, Winter S, Suri N (2019) Assessing the state and improving the art of parallel testing for c. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 123–133
Hu H, Jiang C-H, Ye F, Cai K-Y, Huang D, Yau SS (2010) A parallel implementation strategy of adaptive testing. In: 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops, pp. 214–219. https://doi.org/10.1109/COMPSACW.2010.44
Misailovic S, Milicevic A, Petrovic N, Khurshid S, Marinov D (2007) Parallel test generation and execution with korat. In: Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, pp. 135–144
Siddiqui JH, Khurshid S (2009) Pkorat: Parallel generation of structurally complex test inputs. In: 2009 International Conference on Software Testing Verification and Validation, pp. 250–259. https://doi.org/10.1109/ICST.2009.48
Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient gpu control flow. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 407–420
Brunie N, Collange S, Diamos G (2012) Simultaneous branch and warp interweaving for sustained gpu performance. In: 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49–60
Rhu M, Erez M (2012) Capri: prediction of compaction-adequacy for handling control-divergence in gpgpu architectures. ACM SIGARCH Comput Arch News 40(3):61–71
Article Google Scholar
Rhu M, Erez M (2013) Maximizing simd resource utilization in gpgpus with simd lane permutation. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. pp. 356–367
Fung WWL, Aamodt TM (2011) Thread block compaction for efficient simt control flow. In: Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. HPCA ’11, pp. 25–36. IEEE Computer Society, USA
Li B, Wei J, Guo W, Sun J (2015) Improving simd utilization with thread-lane shuffled compaction in gpgpu. Chin J Electron 24:684–688. https://doi.org/10.1049/cje.2015.10.004
Article Google Scholar
Yang H, Chen S, Wan J, Xu X (2015) Divergent branch threads compaction for efficient simd control flow. Chin J Electron 24(2):288–294
Article Google Scholar
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving gpu performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-44, pp. 308–317. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2155620.2155656
Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture. pp. 235–246
Tarjan D, Meng J, Skadron K (2009) Increasing memory miss tolerance for simd cores. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–11
Emmery C (2017) Euclidean vs. cosine distance. https://cmry.github.io/notes/euclidean-v-cosine. Accessed on 27 June 2021
Ladd JR (2020) Understanding and using common similarity measures for text analysis. https://programminghistorian.org/en/lessons/common-similarity-measures. Accessed on 11 Jul 2021
Nvidia (2018) cuda-c-programming-guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability. Accessed on 28 Jul 2018
die.net (2011) clock_gettime(3); Linux man page. https://linux.die.net/man/3/clock_gettime. Accessed on 16 Oct 2018
Yang C-T, Huang C-L, Lin C-F (2011) Hybrid cuda, openmp and mpi parallel programming on multicore gpu clusters. Comput Phys Commun. 182(1):266–269. https://doi.org/10.1016/j.cpc.2010.06.035. Computer Physics Communications Special Edition for Conference on Computational Physics Kaohsiung, Taiwan, Dec 15–19, 2009
Harris M (2011) How to Implement Performance Metrics in CUDA C/C++. https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/. Accessed on 16 Oct 2018
NVIDIA (2012) NVIDIA CUDA C Programming Guide. https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf. Accessed on 26 Jan 2019
Pouchet LN (2016) PolyBench/C 3.2. http://polybench.sourceforge.net. Accessed on 14 Oct 2018
Kalliamvakou E, Damian D, Blincoe K, Singer L, German DM (2015) Open source-style collaborative development practices in commercial projects using github. In: Proceedings of the 37th International Conference on Software Engineering: Volume 1. ICSE ’15, pp. 574–585. IEEE Press, Piscataway, NJ, USA. http://dl.acm.org/citation.cfm?id=2818754.2818825
Quijada M (2014) image-manipulation-in-c. https://github.com/mauryquijada/image-manipulation-in-c.git. Accessed on 29 Jul 2018
Yerburgh E (2017) c-sorting-algorithms. https://github.com/eddyerburgh/c-sorting-algorithms/tree/master/algorithms. Accessed on 17 Aug 2019
Felipe L (2018) VAR-solutions. https://github.com/luizok/GraphAlgorithms/blob/master/graphalgs.c. Accessed on 14 Jan 2020
Varshney R (2018) VAR-solutions. https://github.com/VAR-solutions/Algorithms/tree/dev/Dynamic%20Programming. Accessed on 14 Jan 2020
Corporation N (2019) Nsight Compute CLI. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html. Accessed on 4 Sep 2019
NVIDIA (2018) NVIDIACUDA Toolkit Documentation. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference. Accessed on 14 Oct 2018
Corporation N (2019) Nsight Compute CLI-5.3. Metric Comparison. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison. Accessed on 4 Sept 2019
Family SP (2021) Sonargraph-Architect. https://www.hello2morrow.com/products/sonargraph/architect9. Accessed on 16 Nov 2021
Whitehead N, Fit-Florea A (2011) Precision & performance: floating point and ieee 754 compliance for nvidia gpus. rn (A+ B) 21(1):18749–19424
NVIDIA (2017) NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed on 28 Apr 2023
Repository S-aI: SIR Usage Information. https://sir.csc.ncsu.edu/portal/usage.php. Accessed on 3 Apr 2019
Bagies T, Jannesari A (2021) An empirical study of parallelizing test execution using cuda unified memory and openmp gpu offloading. In: 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 271–278. https://doi.org/10.1109/ICSTW52544.2021.00052
Zhang L (2018) Hybrid regression test selection. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 199–209
Marijan D, Liaaen M (2018) Practical selective regression testing with effective redundancy in interleaved tests. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 153–162

Download references

Acknowledgements

We would like to thank King Abdulaziz University, Jeddah, Saudi Arabia for supporting the Ph.D. scholarship to work on research.

Funding

The authors declare no funding.

Author information

Authors and Affiliations

King Abdulaziz University, Jeddah, Saudi Arabia
Taghreed Bagies
Iowa State University, Ames, IA, USA
Wei Le, Jeremy Sheaffer & Ali Jannesari

Authors

Taghreed Bagies
View author publications
You can also search for this author in PubMed Google Scholar
Wei Le
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Sheaffer
View author publications
You can also search for this author in PubMed Google Scholar
Ali Jannesari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

TB wrote the main manuscript text and prepared all figures and tables. AJ reviewed the manuscript

Corresponding author

Correspondence to Taghreed Bagies.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bagies, T., Le, W., Sheaffer, J. et al. Reducing branch divergence to speed up parallel execution of unit testing on GPUs. J Supercomput 79, 18340–18374 (2023). https://doi.org/10.1007/s11227-023-05375-0

Download citation

Accepted: 03 May 2023
Published: 13 May 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11227-023-05375-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Block-Size Independence for GPU Programs

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

CLTestCheck: Measuring Test Effectiveness for GPU Kernels

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Reducing branch divergence to speed up parallel execution of unit testing on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Block-Size Independence for GPU Programs

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

CLTestCheck: Measuring Test Effectiveness for GPU Kernels

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation