Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

A Modern C++ Parallel Task Programming Library

Uploaded by

maxbsp66bit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Modern C++ Parallel Task Programming Library

Uploaded by

maxbsp66bit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Open Source Software Competition MM ’19, October 21–25, 2019, Nice, France

A Modern C++ Parallel Task Programming Library


Chun-Xun Lin∗ Tsung-Wei Huang∗ Guannan Guo Martin D. F. Wong
ECE Dept, UIUC, IL ECE Dept, University of ECE Dept, UIUC, IL ECE Dept, UIUC, IL
clin99@illinois.edu Utah, UT gguo4@illinois.edu mdfwong@illinois.edu
twh760812@gmail.com

ABSTRACT more difficult than the sequential counterpart. Programmers have


In this paper we present Cpp-Taskflow, a C++ parallel program- to pay extra attentions to the concurrency control to avoid unex-
ming library that enables users to quickly develop parallel appli- pected behavior during runtime, for example, using locks to pro-
cations using the task dependency graph model. Developers for- tect shared data or atomic variables to avoid data race. The situ-
mulate their application as a task dependency graph and Cpp- ation is getting more challenging when applications exhibit com-
Taskflow will manage the task execution and concurrency con- plex data or operation dependency, which is typical in real-world
trol. The task graph model is expressive and composable. It can problems. As a result, it’s necessary to have an efficient approach
express both regular and irregular parallel patterns, and develop- to write parallel code.
ers can quickly compose large programs from small parallel mod- In this paper, we present Cpp-Taskflow, a modern C++ task-
ules. Cpp-Taskflow has an intuitive and unified API set. Users only based parallel programming library. Cpp-Taskflow was motivated
need to learn the APIs to build and dispatch a task graph and from a real-world project of VLSI timing analysis [6]. Cpp-
no complex parallel programming concept is required. We have Taskflow lets users express their parallelism using the intuitive task
conducted experiments using both micro-benchmarks and real- graph model. The task graph model is simple yet very powerful as
world applications and Cpp-Taskflow outperforms state-of-the-art it can represent both regular and irregular parallel patterns. The
parallel programming libraries in both runtime and coding effort. task graph model abstracts away complex concurrency manage-
Cpp-Taskflow is open-source and has been used in both industry ment and allows users to focus on exploiting parallelism within
and academic projects. From our users’ feedback, we believe Cpp- their applications. Cpp-Taskflow provides well-designed APIs to
Taskflow can benefit the industry and research community greatly keep the code concise and readable. We have a unified task graph
through its ease-of-programming and inspire new research direc- construction interface for both static and dynamic parallelism, so
tions in multimedia system/software design. users can learn those APIs quickly and utilize them to implement
various parallel patterns. Cpp-Taskflow supports visualization for
KEYWORDS program debugging and profiling. Users can dump the task graph
to inspect the program execution flow and they can view the thread
Parallel programming; task parallelism; task dependency graph
activities in a Chrome browser. We have conducted experiments
ACM Reference Format: on a set of micro-benchmarks and a real-world application [7]
Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, and Martin D. F. Wong. against Intel Threading Building Blocks [8] and OpenMP [9]. Cpp-
2019. A Modern C++ Parallel Task Programming Library. In Proceed-
Taskflow achieves comparable performance with fewer lines of
ings of the 27th ACM International Conference on Multimedia (MM ’19),
code, faster runtime, and better scalability.
October 21–25, 2019, Nice, France. ACM, New York, NY, USA, 4 pages.
https://doi.org/10.1145/3343031.3350537 We understand each library has its own uniqueness and value,
and it’s up to users to decide which best suits their needs.
1 INTRODUCTION Cpp-Taskflow has been used in many industrial and academic
projects [10]. We are committed to free sharing of our technical in-
Multicore processors are prevalent from desktops, laptops, tablets novation to facilitate the research in parallel computing, machine
to mobile devices. How to effectively utilize those computing re- learning, and multimedia. We are working actively with our users
sources to maximize software performance? This is a critical ques- to improve Cpp-Taskflow. The project is open-source and more de-
tion that software developers must consider, especially when build- tails can be found in [10].
ing complex parallel applications such as artificial intelligence, nu-
merical simulation, machine learning and multimedia big data an-
alytics [1] [2] [3] [4] [5]. Writing parallel code is considered much 2 CPP-TASKFLOW
∗ Both
In Cpp-Taskflow, the programming is centered around two classes:
authors contributed equally to this research.
tf::Taskflow and tf::Executor . We will explain how to use them in
Permission to make digital or hard copies of all or part of this work for personal or this section.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than 2.1 Task Dependency Graph
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission In Cpp-Taskflow, a task is a C++ object of Callable type [7]. To
and/or a fee. Request permissions from permissions@acm.org. create tasks, the first step is to create an object of the tf::Taskflow
MM ’19, October 21–25, 2019, Nice, France class. A taskflow object allows you to build a task dependency
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6889-6/19/10. . . $15.00 graph where nodes are tasks and directed edges indicate depen-
https://doi.org/10.1145/3343031.3350537 dency. Listing 1 shows an example of adding three tasks via the

2284
Open Source Software Competition MM ’19, October 21–25, 2019, Nice, France

emplace method. The emplace method can create multiple tasks at 20 B3 . g a t h e r ( B1 , B2 ) ;


one time. After tasks are created, users can assign names to tasks 21
and specify the dependency between tasks via the name and precede 22 / / D e t a ch t h e new t a s k d e pe n d e n cy g r a ph
method, respectively. A task A precedes a task B if task B can only 23 subflow . detach ( ) ;
24 }) ;
run after task A completes its execution.
25
1 tf ::Ta skf low taskflow ; 26 / / S p e c i f y t h e d e pe n d e n cy
2 27 taskA . pr e ce d e ( taskB , taskC ) ;
3 / / Create a task 28 taskD . g a t h e r ( taskB , taskC ) ;
4 a u t o t a s k A = t a s k f l o w . e mpl a ce (
5 [ ] ( ) { s t d : : c o u t << " Task A \ n " ; } Listing 2: An example of dynamic tasking.
6 );
7
8 / / C r e a t e two t a s k s a t one t i m e Subflow_TaskB
Subflow_TaskB
9 auto [ taskB , taskC ] = t a s k f l o w . e mpl a ce ( TaskB2 TaskB2
10 [ ] ( ) { s t d : : c o u t << " Task B \ n " ; } , TaskB3
TaskB1
11 [ ] ( ) { s t d : : c o u t << " Task C \ n " ; } , TaskB1
TaskB3 TaskB
12 ); TaskD TaskB
13 TaskA TaskC TaskA TaskD
TaskC
14 / / Name t h e t a s k s
15 t a s k A . name ( " t a s k A " ) ; (a) A joined subflow.
(b) A detached subflow.
16
17 / / S p e c i f y t h e d e pe n d e n cy
18 taskA . pr e ce d e ( taskB , taskC ) ; Figure 1: Comparison of joined and detached subflows.

Listing 1: Create a task dependency graph.


2.2 Dynamic Tasking 2.3 Composition
Cpp-Taskflow has another powerful feature: dynamic tasking that An useful feature of task dependency graph is the composability.
enables a task to create and dispatch a task dependency graph at Users can use the composed_of method to compose several task de-
runtime to obtain dynamic parallelism. Listing 2 shows an example pendency graphs to a large and complex task dependency graph.
of dynamic tasking. In this example task B spawns a task depen- The composed_of method returns a module task. Users can use the
dency graph that has three tasks. A task that requires dynamic par- precede method to add dependency between module tasks and
allelism has to take an additional argument of type tf::Subflow and other tasks. Listing 3 shows an example of task dependency graph
uses the emplace method to create a new task dependency graph. composition.
The new task dependency graph will by default join its parent task. 1 t f : : T a s k f l o w fA ;
However, users can make it run independently by calling the detach 2
method. A detached task dependency graph will join the end of its 3 / / Create four tasks
parent’s task dependency graph. Figure 1 shows the spawned task 4 a u t o [ fA1 , fA2 , fA3 , fA4 ] = fA . e mpl a ce (
dependency graphs in joined and detached modes, respectively. 5 [ ] ( ) { s t d : : c o u t << " Task fA1 \ n " ; } ,
Dynamic tasking empowers users to parallelize frequently used 6 ...
computing patterns such as recursive and nested flows. 7 );
8 fA1 . p r e c e d e ( fA2 , fA3 ) ;
1 t f : : T a s k f l o w flow ; 9 fA4 . g a t h e r ( fA2 , fA3 ) ;
2 10
3 / / Create three tasks 11 t f : : T a s k f l o w fB ;
4 a u t o [ taskA , taskC , t a s k D ] = f l o w . e mpl a ce ( 12
5 [ ] ( ) { s t d : : c o u t << " Task A\ n " ; } , 13 / / Create three tasks
6 [ ] ( ) { s t d : : c o u t << " Task C\ n" ; } , 14 a u t o [ fB1 , fB2 , f B 3 ] = f B . e mpl a ce (
7 [ ] ( ) { s t d : : c o u t << " Task D\ n " ; } 15 [ ] ( ) { s t d : : c o u t << " Task f B 1 \ n " ; } ,
8 ); 16 ...
9 17 );
10 / / Cr e ate a t a s k with subflow 18
11 a u t o t a s k B = f l o w . e mpl a ce ( 19 a u t o moduleA = f B . compos e d _ of ( fA ) ;
12 [ ] ( a u t o &s u b f l o w ) { 20
13 s t d : : c o u t << " Task B \ n " ; 21 f B 1 . p r e c e d e ( moduleA , f B 2 ) ;
14 / / Spawn a new t a s k d e pe n d e n cy g r a ph 22 moduleA . p r e c e d e ( f B 3 ) ;
15 a u t o [ B1 , B2 , B3 ] = s u b f l o w . e mpl a ce ( 23 fB2 . pr e ce d e ( fB3 ) ;
16 [ ] ( ) { s t d : : c o u t << " Task B1 \ n " ; } ,
17 [ ] ( ) { s t d : : c o u t << " Task B2 \ n " ; } , Listing 3: An example of task dependency graph
18 [ ] ( ) { s t d : : c o u t << " Task B3 \ n " ; } composition.
19 );

2285
Open Source Software Competition MM ’19, October 21–25, 2019, Nice, France

8 [ ] ( ) { s t d : : c o u t << " A1 \ n " ; } ,


9 ...
10 );
11
12 // S p e c i f y d e pe n d e n cy
13 A1 . p r e c e d e ( A3 , A4 ) ;
14 A2 . p r e c e d e ( A5 ) ;
Figure 2: The task dependency graphs of the example in List- 15 A6 . g a t h e r ( A3 , A5 ) ;
ing 3. 16 A4 . p r e c e d e ( A7 ) ;
17
18 / / Dump t h e t a s k d e pe n d e n cy g r a ph
2.4 Execution 19 s t d : : c o u t << fA . dump ( ) << s t d : : e n d l ;
After creating task dependency graphs, the next step is to dispatch 20
graphs to an executor object of type tf::Executor . An executor ob- 21 t f : : T a s k f l o w fB ;
ject manages thread construction and destruction and provides sev- 22
eral methods to execute task dependency graphs through an effi- 23 / / Add f i v e t a s k s
cient work-stealing algorithm. Table 1 summarizes the execution 24 a u t o [ B1 , B2 , B3 , B4 , B5 ] = f B . e mpl a ce (
methods and Listing 4 demonstrates the usage of those execution 25 [ ] ( ) { s t d : : c o u t << " B1 \ n " ; } ,
26 ...
methods.
27 );
28
Method Description 29 / / Compose t a s k f l o w A
run Execute a graph once 30 a u t o moduleA1 = f B . compos e d _ of ( fA ) ;
run_n Execute a graph multiple times 31
run_until Execute a graph until a condition is met 32 // S p e c i f y d e pe n d e n cy
wait_for_all Wait until all running graphs finish 33 B1 . p r e c e d e ( B2 , moduleA1 ) ;
Table 1: Summary of execution methods. 34 B2 . p r e c e d e ( B3 , B4 ) ;
35 B5 . g a t h e r ( B3 , B4 , moduleA1 ) ;
36
1 tf::Taskflow tf ; 37 s t d : : c o u t << f B . dump ( ) << s t d : : e n d l ;
2
3 / / Add t a s k s t o t f Listing 5: Visualization of a task dependency graph.
4 ...
5
6 t f : : E x e c u t o r executor ;
7
8 e x e c u t o r . run ( t f ) ; / / Run t h e f l o w once
9 e x e c u t o r . run _n ( t f , 6 ) ; / / Run t h e f l o w s i x t i m e s
10 / / Run t h e f l o w u n t i l t h e number becomes 0
11 e x e c u t o r . run _ u n t i l ( t f , [ number = 4 ] ( ) mu t a b l e {
12 r e t u r n number−− == 0:
13 }) ; Figure 3: The task dependency graphs of two taskflow ob-
jects in Listing 5.
Listing 4: Demonstration of different execution methods.
Profiling is very important when developers analyze their ap-
2.5 Debugging and Profiling plication’s performance. Cpp-Taskflow allows users to record the
Debugging a parallel program is very challenging due to the non- thread activities and visualize them in a Chrome browser. To en-
deterministic nature. Cpp-Taskflow supports the visualization of able profiling, users create an observer of type tf::Executor Observer
task dependency graphs to let users inspect the task execution flow. through executor’s make_observer method. An observer will record
Users can use the name method to assign a taskflow object a name each task’s start time (via on_entry method) and end time (via
and the dump method to export the object’s task graph in DOT on_exit method) during execution. The observer can dump the
format [11]. Listing 5 shows an example of naming and dumping recorded timestamps into a JSON file and users can visualize the
a task dependency graph. Figure 3 demonstrates the task depen- execution timeline by loading the JSON file in the chrome://tracing
dency graphs of two taskflow objects. developer tool. Listing 6 shows how to create an observer to moni-
tor the thread activities. Figure 4 displays the task execution time-
1 t f : : T a s k f l o w fA ;
line in a Chrome browser.
2
3 / / Naming t h e t a s k f l o w o b j e c t 1 tf::Taskflo w taskflow ;
4 fA . name ( " Taskflow_A " ) ; 2 t f : : E x e c u t o r executor ;
5 3 / / C r e a t e an o b s e r v e r
6 / / Add s e v e n t a s k s 4 a u t o o b s e r v e r = e x e c u t o r . m a k e _ o b s e r v e r<
7 a u t o [ A1 , A2 , A3 , A4 , A5 , A6 , A7 ] = fA . e mpl a ce ( t f : : E x e c u t o r Observer >() ;

2286
Open Source Software Competition MM ’19, October 21–25, 2019, Nice, France

5 MNIST (10 cores)


6 / / Add t a s k s and d i s p a t c h t h e f l o w t o e x e c u t i o n
7 ... 100
OpenMP
8
Taskflow

Runtime (sec)
9 / / Dump t h e t i m e s t a m p s t o a JSON f i l e
10 s t d : : ofstream of s ( " timestamps . json " ) ;
11 o b s e r v e r −>dump ( o f s ) ; 50

Listing 6: Use an observer to monitor the thread activities.

0
10 20 30 40 50 60 70 80 90 100 110 120 130
Number of epochs
Figure 5: DNN training runtime of OpenMP and Cpp-
Taskflow with using taskflow object.

Table 2: Code complexity [13] of the three implementations.


Library Total NLOC Avg Token Avg CCN
OpenMP 93 1058 11
Figure 4: Thread activities displayed in chrome://tracing.
Taskflow 60 600 11
NLOC: lines of code. CCN: cyclomatic complexity number.
3 A MACHINE LEARNING APPLICATION
Machine learning has been successfully applied to several multi- 4 AVAILABILITY
media topics such as image classification, speech recognition and
so on [4] [5]. We demonstrate applying Cpp-Taskflow to parallelize Cpp-Taskflow is open-source on Github [10] under MIT license.
a machine learning application: MNIST [12] dataset, and compare The API documentation, tutorials and cookbook are also available
its performance and coding effort with OpenMP [9]. The MNIST on Github. We have presented Cpp-Taskflow at CppCon which
dataset contains images of handwritten digits and it is widely used is the premier C++ developer conference and the video is on
to test the effectiveness of machine learning algorithms. In this YouTube [14].
demonstration, we build a 5-layer deep neural network (DNN) to
classify those images. We adopt the task pipeline strategy proposed
5 ACKNOWLEDGEMENT
by [7] to parallelize the DNN training. Each batch starts with a Cpp-Taskflow is supported by NSF Grant CCF-1718883 and
task for forward propagation and then followed by a sequence of DARPA Grant FA 8650-18-2-7843.
gradient calculation and weight update tasks for each layer. We
pipeline the gradient calculation and weight update tasks between REFERENCES
successive layers to enable parallelism within each batch. Next we [1] W. Zhu, P. Cui, Z. Wang, and G. Hua. Multimedia big data computing. IEEE
MultiMedia, 22(3):96–c3, July 2015.
create tasks for data shuffle per epoch. We allocate additional data [2] Z. Wang, S. Mao, L. Yang, and P. Tang. A survey of multimedia big data. China
storages to have a shuffle task start earlier preparing the data for Communications, 15(1):155–176, Jan 2018.
[3] Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, Mei-Ling Shyu, and S. S. Iyen-
later epochs. We compare the implementations of OpenMP [9] and gar. Multimedia big data analytics: A survey. ACM Comput. Surv., 51(1):10:1–
Cpp-Taskflow. OpenMP is the most popular parallel programming 10:34, January 2018.
library in high-performance computing and we use OpenMP’s task [4] Jitao Sang, Jun Yu, Ramesh Jain, Rainer Lienhart, Peng Cui, and Jiashi Feng. Deep
learning for multimedia: Science or technology? In Proceedings of the 26th ACM
depend clause to implement this parallelization strategy. For Cpp- International Conference on Multimedia, MM ’18, pages 1354–1355, New York,
Taskflow, we implement this application using the taskflow object NY, USA, 2018. ACM.
and use the executor’s run method for execution. We run all im- [5] Y. Yan, M. Chen, M. Shyu, and S. Chen. Deep learning for imbalanced multimedia
data classification. In 2015 IEEE International Symposium on Multimedia (ISM),
plementations on a machine with a Intel Xeon W-2175 processor pages 483–488, Dec 2015.
and 128 GB memory and we launch 10 threads in this experiment. [6] T.-W Huang and Martin D. F. Wong. OpenTimer: A high-performance timing
analysis tool. In IEEE/ACM ICCAD, pages 895–902, 2015.
The operating system is Ubuntu 18.04 and the OpenMP version is [7] T.-W Huang, C.-X. Lin, Guannan Guo, and Martin D. F. Wong. Cpp-Taskflow:
4.5 (201511). The learning rate is set to 0.0001 and for each num- Fast Task-based Parallel Programming using Modern C++. IEEE IPDPS, pages
ber of epochs we run five times and take the average. During the 974–983, 2019.
[8] Intel Threading Building Blocks. [Online]. Available:
experiment we use the system command taskset to bond the pro- https://www.threadingbuildingblocks.org.
cess to the first 10 cores. Figure 5 plots the runtime of both im- [9] OpenMP. [Online]. Available: https://www.openmp.org/.
plementations and Table 2 shows the code complexity measured [10] Cpp-Taskflow. [Online]. Available: https://github.com/cpp-taskflow/cpp-taskflow.
[11] The DOT Language. [Online]. Available: https://www.graphviz.org/.
by Lizard [13]1 . The implementation of Cpp-Taskflow is slightly [12] MNIST. [Online]. Available: http://yann.lecun.com/exdb/mnist/.
faster than OpenMP. Regarding the code complexity, OpenMP is [13] Lizard. [Online]. Available: http://www.lizard.ws/.
[14] Cpp-Taskflow lightning talk. [Online].
35% longer than Cpp-Taskflow in lines of code. Available: https://www.youtube.com/watch?v=ho9bqIJkvkc.
1 BecauseLizard takes compiler directives (starts with #) as a comment, we remove
the # when measuring the OpenMP implementation.

2287

You might also like