0% found this document useful (0 votes)

201 views

Patterns of Parallel Programming

Patterns of Parallel Programming in Visual Basic

Uploaded by

jayanthan85

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

201 views

Patterns of Parallel Programming

Patterns of Parallel Programming in Visual Basic

Uploaded by

jayanthan85

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 117

PATTERNS OF PARALLEL PROGRAMMING

UNDERSTANDING AND APPLYING PARALLEL PATTERNS WITH THE .NET FRAMEWORK 4 AND VISUAL BASIC

Stephen Toub Parallel Computing Platform Microsoft Corporation

Abstract This document provides an in-depth tour of support in the Microsoft .NET Framework 4 for parallel programming. This includes an examination of common parallel patterns and how theyre implemented without and with this new support, as well as best practices for developing parallel components utilizing parallel patterns.

Last Updated: July 1, 2010

This material is provided for informational purposes only. Microsoft makes no warranties, express or implied. 2010 Microsoft Corporation.

TABLE OF CONTENTS
Introduction ...................................................................................................................................................................3 Delightfully Parallel Loops .............................................................................................................................................4 Fork/Join ......................................................................................................................................................................35 Passing Data .................................................................................................................................................................48 Producer/Consumer ....................................................................................................................................................52 Aggregations ................................................................................................................................................................66 MapReduce ..................................................................................................................................................................74 Dependencies ..............................................................................................................................................................76 Data Sets of Unknown Size ..........................................................................................................................................87 Speculative Processing ................................................................................................................................................93 Laziness ........................................................................................................................................................................96 Shared State ..............................................................................................................................................................104 Conclusion .................................................................................................................................................................117

Patterns of Parallel Programming

Page 2

INTRODUCTION
Patterns are everywhere, yielding software development best practices and helping to seed new generations of developers with immediate knowledge of established directions on a wide array of problem spaces. Patterns represent successful (or in the case of anti-patterns, unsuccessful) repeated and common solutions developers have applied time and again in particular architectural and programming domains. Over time, these tried and true practices find themselves with names, stature, and variations, helping further to proliferate their application and to jumpstart many a project. Patterns dont just manifest at the macro level. Whereas design patterns typically cover architectural structure or methodologies, coding patterns and building blocks also emerge, representing typical ways of implementing a specific mechanism. Such patterns typically become ingrained in our psyche, and we code with them on a daily basis without even thinking about it. These patterns represent solutions to common tasks we encounter repeatedly. Of course, finding good patterns can happen only after many successful and failed attempts at solutions. Thus for new problem spaces, it can take some time for them to gain a reputation. Such is where our industry lies today with regards to patterns for parallel programming. While developers in high-performance computing have had to develop solutions for supercomputers and clusters for decades, the need for such experiences has only recently found its way to personal computing, as multi-core machines have become the norm for everyday users. As we move forward with multi-core into the manycore era, ensuring that all software is written with as much parallelism and scalability in mind is crucial to the future of the computing industry. This makes patterns in the parallel computing space critical to that same future. In general, a multi-core chip refers to eight or fewer homogeneous cores in one microprocessor package, whereas a manycore chip has more than eight possibly heterogeneous cores in one microprocessor package. In a manycore system, all cores share the resources and services, including memory and disk access, provided by the operating system. The Manycore Shift, (Microsoft Corp., 2007) In the .NET Framework 4, a slew of new support has been added to handle common needs in parallel programming, to help developers tackle the difficult problem that is programming for multi-core and manycore. Parallel programming is difficult for many reasons and is fraught with perils most developers havent had to experience. Issues of races, deadlocks, livelocks, priority inversions, two-step dances, and lock convoys typically have no place in a sequential world, and avoiding such issues makes quality patterns all the more important. This new support in the .NET Framework 4 provides support for key parallel patterns along with building blocks to help enable implementations of new ones that arise. To that end, this document provides an in-depth tour of support in the .NET Framework 4 for parallel programming, common parallel patterns and how theyre implemented without and with this new support, and best practices for developing parallel components in this brave new world. This document only minimally covers the subject of asynchrony for scalable, I/O-bound applications: instead, it focuses predominantly on applications of CPU-bound workloads and of workloads with a balance of both CPU and I/O activity. This document also does not cover Visual F# in Visual Studio 2010, which includes language-based support for several key parallel patterns.

Patterns of Parallel Programming

Page 3

DELIGHTFULLY PARALLEL LOOPS

Arguably the most well-known parallel pattern is that befitting Embarrassingly Parallel algorithms. Programs that fit this pattern are able to run well in parallel because the many individual operations being performed may operate in relative independence, with few or no dependencies between operations such that they can be carried out in parallel efficiently. Its unfortunate that the embarrassing moniker has been applied to such programs, as theres nothing at all embarrassing about them. In fact, if more algorithms and problem domains mapped to the embarrassing parallel domain, the software industry would be in a much better state of affairs. For this reason, many folks have started using alternative names for this pattern, such as conveniently parallel, pleasantly parallel, and delightfully parallel, in order to exemplify the true nature of these problems. If you find yourself trying to parallelize a problem that fits this pattern, consider yourself fortunate, and expect that your parallelization job will be much easier than it otherwise could have been , potentially even a delightful activity. A significant majority of the work in many applications and algorithms is done through loop control constructs. Loops, after all, often enable the application to execute a set of instructions over and over, applying logic to discrete entities, whether those entities are integral values, such as in the case of a for loop, or sets of data, such as in the case of a for each loop. Many languages have built-in control constructs for these kinds of loops, Microsoft Visual C# and Microsoft Visual Basic being among them, the former with for and foreach keywords, and the latter with For and For Each keywords. For problems that may be considered delightfully parallel, the entities to be processed by individual iterations of the loops may execute concurrently: thus, we need a mechanism to enable such parallel processing.

IMPLEMENTING A PARALLEL LOOPING CONSTRUCT

As delightfully parallel loops are such a predominant pattern, its really important to understand the ins and outs of how they work, and all of the tradeoffs implicit to the pattern. To understand these concepts further, well build a simple parallelized loop using support in the .NET Framework 3.5, prior to the inclusion of the more comprehensive parallelization support introduced in the .NET Framework 4. First, we need a signature. To parallelize a for loop, well implement a method that takes three para meters: a lower-bound, an upper-bound, and a delegate for the loop body that accepts as a parameter an integral value to represent the current iteration index (that delegate will be invoked once for each iteration). Note that we have several options for the behavior of these parameters. With Visual Basic and C#, the vast majority of for loops are written in a manner similar to the following:

Visual Basic
For i As Integer = 0 To upperBound ' ... loop body here Next

C#
for (int i = 0; i < upperBound; i++) { // ... loop body here }

Contrary to what a cursory read may tell you, these two loops are not identical: the Visual Basic loop will execute one more iteration than will the C# loop. This is because Visual Basic treats the supplied upper-bound as inclusive, Patterns of Parallel Programming Page 4

whereas we explicitly specified it in C# to be exclusive through our use of the less-than operator. For our purposes here, well follow suit to the C# implementation, and well have the upper -bound parameter to our parallelized loop method represent an exclusive upper-bound:

Visual Basic
Public Shared Sub MyParallelFor( ByVal inclusiveLowerBound As Integer, ByVal exclusiveUpperBound As Integer, ByVal body As Action(Of Integer)) End Sub

Our implementation of this method will invoke the body of the loop once per element in the range [inclusiveLowerBound,exclusiveUpperBound), and will do so with as much parallelization as it can muster. To accomplish that, we first need to understand how much parallelization is possible. Wisdom in parallel circles often suggests that a good parallel implementation will use one thread per core. After all, with one thread per core, we can keep all cores fully utilized. Any more threads, and the operating system will need to context switch between them, resulting in wasted overhead spent on such activities; any fewer threads, and theres no chance we can take advantage of all that the machine has to offer, as at least one core will be guaranteed to go unutilized. This logic has some validity, at least for certain classes of problems. But the logic is also predicated on an idealized and theoretical concept of the machine. As an example of where this notion may break down, to do anything useful threads involved in the parallel processing need to access data, and accessing data requires trips to caches or main memory or disk or the network or other stores that can cost considerably in terms of access times; while such activities are in flight, a CPU may be idle. As such, while a good parallel implementation may assume a default of one-thread-per-core, an open mindedness to other mappings can be beneficial. For our initial purposes here, however, well stick with the one-thread-per core notion. With the .NET Framework, retrieving the number of logical processors is achieved using the System.Environment class, and in particular its ProcessorCount property. Under the covers, .NET retrieves the corresponding value by delegating to the GetSystemInfo native function exposed from kernel32.dll. This value doesnt necessarily correlate to the number of physical processors or even to the number of physical cores in the machine. Rather, it takes into account the number of hardware threads available. As an example, on a machine with two sockets, each with four cores, each with two hardware threads (sometimes referred to as hyperthreads), Environment.ProcessorCount would return 16. Starting with Windows 7 and Windows Server 2008 R2, the Windows operating system supports greater than 64 logical processors, and by default (largely for legacy application reasons), access to these cores is exposed to applications through a new concept known as processor groups. The .NET Framework does not provide managed access to the processor group APIs, and thus Environment.ProcessorCount will return a value capped at 64 (the maximum size of a processor group), even if the machine has a larger number of processors. Additionally, in a 32-bit process, ProcessorCount will be capped further to 32, in order to map well to the 32-bit mask used to represent processor affinity (a requirement that a particular thread be scheduled for execution on only a specific subset of processors). Patterns of Parallel Programming Page 5

Once we know the number of processors we want to target, and hence the number of threads, we can proceed to create one thread per core. Each of those threads will process a portion of the input range, invoking the supplied Action<int> delegate for each iteration in that range. Such processing requires another fundamental operation of parallel programming, that of data partitioning. This topic will be discussed in greater depth later in this document; suffice it to say, however, that partitioning is a distinguishing concept in parallel implementations, one that separates it from the larger, containing paradigm of concurrent programming. In concurrent programming, a set of independent operations may all be carried out at the same time. In parallel programming, an operation must first be divided up into individual sub-operations so that each sub-operation may be processed concurrently with the rest; that division and assignment is known as partitioning. For the purposes of this initial implementation, well use a simple partitioning scheme: statically dividing the input range into one range per thread. Here is our initial implementation:

Visual Basic
Public Shared ByVal ByVal ByVal Sub MyParallelFor( inclusiveLowerBound As Integer, exclusiveUpperBound As Integer, body As Action(Of Integer))

' Determine the number of iterations to be processed, the number of ' cores to use, and the approximate number of iterations to process ' in each thread. Dim size = exclusiveUpperBound inclusiveLowerBound Dim numProcs = Environment.ProcessorCount Dim range = size \ numProcs ' Use a thread for each partition. Create them all, ' start them all, wait on them all. Dim threads = New List(Of Thread)(numProcs) For p = 0 To numProcs 1 Dim start = p * range + inclusiveLowerBound Dim [end] = If((p = numProcs - 1), exclusiveUpperBound, start + range) threads.Add(New Thread(Sub() For i = start To [end] 1 body(i) Next i End Sub)) Next p For Each thread In threads thread.Start() Next thread For Each thread In threads thread.Join() Next thread End Sub

There are several interesting things to note about this implementation. One is that for each range, a new thread is utilized. That thread exists purely to process the specified partition, and then it terminates. This has several positive and negative implications. The primary positive to this approach is that we have dedicated threading resources for this loop, and it is up to the operating system to provide fair scheduling for these threads across the system. This positive, however, is typically outweighed by several significant negatives. One such negative is the cost of a thread. By default in the .NET Framework 4, a thread consumes a megabyte of stack space, whether or not that space is used for currently executing functions. In addition, spinning up a new thread and tearing one Patterns of Parallel Programming Page 6

down are relatively costly actions, especially if compared to the cost of a small loop doing relatively few iterations and little work per iteration. Every time we invoke our loop implementation, new threads will be spun up and torn down. Theres another, potentially more damaging impact: oversubscription. As we move forward in the world of multicore and into the world of manycore, parallelized components will become more and more common, and its quite likely that such components will themselves be used concurrently. If such components each used a loop like the above, and in doing so each spun up one thread per core, wed have two components each fighting for the machines resources, forcing the operating system to spend more time context switching between components . Context switching is expensive for a variety of reasons, including the need to persist details of a threads execution prior to the operating system context switching out the thread and replacing it with another. Potentially more importantly, such context switches can have very negative effects on the caching subsystems of the machine. When threads need data, that data needs to be fetched, often from main memory. On modern architectures, the cost of accessing data from main memory is relatively high compared to the cost of running a few instructions over that data. To compensate, hardware designers have introduced layers of caching, which serve to keep small amounts of frequently-used data in hardware significantly less expensive to access than main memory. As a thread executes, the caches for the core on which its executing tend to fill with data appropriate to that threads execution, improving its performance. When a thread gets context switched out, the caches will shift to containing data appropriate to that new thread. Filling the caches requires more expensive trips to main memory. As a result, the more context switches there are between threads, the more expensive trips to main memory will be required, as the caches thrash on the differing needs of the threads using them. Given these costs, oversubscription can be a serious cause of performance issues. Luckily, the new concurrency profiler views in Visual Studio 2010 can help to identify these issues, as shown here:

In this screenshot, each horizontal band represents a thread, with time on the x-axis. Green is execution time, red is time spent blocked, and yellow is time where the thread could have run but was preempted by another thread. The more yellow there is, the more oversubscription there is hurting performance. To compensate for these costs associated with using dedicated threads for each loop, we can resort to pools of threads. The system can manage the threads in these pools, dispatching the threads to access work items queued for their processing, and then allowing the threads to return to the pool rather than being torn down. This addresses many of the negatives outlined previously. As threads arent constantly being created and torn down, the cost of their life cycle is amortized over all the work items they process. Moreover, the manager of the thread pool can enforce an upper-limit on the number of threads associated with the pool at any one time, placing a limit on the amount of memory consumed by the threads, as well as on how much oversubscription is allowed. Ever since the .NET Framework 1.0, the System.Threading.ThreadPool class has provided just such a thread pool, and while the implementation has changed from release to release (and significantly so for the .NET Framework 4), Patterns of Parallel Programming Page 7

the core concept has remained constant: the .NET Framework maintains a pool of threads that service work items provided to it. The main method for doing this is the Shared QueueUserWorkItem. We can use that support in a revised implementation of our parallel For loop:

Visual Basic
Public Shared Sub MyParallelFor( ByVal inclusiveLowerBound As Integer, ByVal exclusiveUpperBound As Integer, ByVal body As Action(Of Integer)) ' Determine the number of iterations to be processed, the number of ' cores to use, and the approximate number of iterations to process in ' each thread. Dim size = exclusiveUpperBound - inclusiveLowerBound Dim numProcs = Environment.ProcessorCount Dim range = size \ numProcs ' Keep track of the number of threads remaining to complete. Dim remaining = numProcs Using mre As New ManualResetEvent(False) ' Create each of the threads. For p As Integer = 0 To numProcs - 1 Dim start = p * range + inclusiveLowerBound Dim [end] = If((p = numProcs - 1), exclusiveUpperBound, start + range) ThreadPool.QueueUserWorkItem(Sub() For i = start To [end] - 1 body(i) Next i If Interlocked.Decrement(remaining) = 0 Then mre.Set() End Sub) Next p ' Wait for all threads to complete. mre.WaitOne() End Using End Sub

This removes the inefficiencies in our application related to excessive thread creation and tear down, and it minimizes the possibility of oversubscription. However, this inefficiency was just one problem with the implementation: another potential problem has to do with the static partitioning we employed. For workloads that entail the same approximate amount of work per iteration, and when running on a relatively quiet machine (meaning a machine doing little else besides the target workload), static partitioning represents an effective and efficient way to partition our data set. However, if the workload is not equivalent for each iteration, either due to the nature of the problem or due to certain partitions completing more slowly due to being preempted by other significant work on the system, we can quickly find ourselves with a load imbalance. The pattern of a loadimbalance is very visible in the following visualization as rendered by the concurrency profiler in Visual Studio 2010.

Patterns of Parallel Programming

Page 8

In this output from the profiler, the x-axis is time and the y-axis is the number of cores utilized at that time in the applications executions. Green is utilization by our application, yellow is utilization by another application, red is utilization by a system process, and grey is idle time. This trace resulted from the unfortunate assignment of different amounts of work to each of the partitions; thus, some of those partitions completed processing sooner than the others. Remember back to our assertions earlier about using fewer threads than there are cores to do work? Weve now degraded to that situation, in that for a portion of this loops execution, we were executing with fewer cores than were available. By way of example, lets consider a parallel loop from 1 to 12 (inclusive on both ends), where each iteration does N seconds of work with N defined as the loop iteration value (that is, iteration #1 will require 1 second of computation, iteration #2 will require two seconds, and so forth). All in all, this loop will require ((12*13)/2) == 78 seconds of sequential processing time. In an ideal loop implementation on a dual core system, we could finish this loops processing in 39 seconds. This could be accomplished by having one core process iterations 6, 10, 11, and 12, with the other core processing the rest of the iterations.

However, with the static partitioning scheme weve employed up until this point, one core will be assigned the range [1,6] and the other the range [7,12].

Patterns of Parallel Programming

Page 9

As such, the first core will have 21 seconds worth of work, leaving the latter core 57 seconds worth of work. Since the loop isnt finished until all iterations have been processed, our loops processing time is limited by the maximum processing time of each of the two partitions, and thus our loop completes in 57 seconds instead of the aforementioned possible 39 seconds. This represents an approximate 50 percent decrease in potential performance, due solely to an inefficient partitioning. Now you can see why partitioning has such a fundamental place in parallel programming. Different variations on static partitioning are possible. For example, rather than assigning ranges, we could use a form of round-robin, where each thread has a unique identifier in the range [0,# of threads), and where each thread processes indices from the loop where the index mod the number of threads matches the threads identifier. For example, with the iteration space [0,12) and with four threads, thread #0 would process iteration values 0, 3, 6, and 9; thread #1 would process iteration values 1, 4, 7, and 10; and so on. If we were to apply this kind of round-robin partitioning to the previous example, instead of one thread taking 21 seconds and the other taking 57 seconds, one thread would require 36 seconds and the other 42 seconds, resulting in a much smaller discrepancy from the optimal runtime of 38 seconds.

To do the best static partitioning possible, you need to be able to accurately predict ahead of time how long all the iterations will take. Thats rarely feasible, resulting in a need for a more dynamic partitioning, where the system can adapt to changing workloads quickly. We can address this by shifting to the other end of the partitioning tradeoffs spectrum, with as much load-balancing as possible.

Fully Static

Spectrum of Partitioning Tradeoffs

Fully Dynamic

Less Synchronization

More Load-Balancing

To do that, rather than pushing to each of the threads a given set of indices to process, we can have the threads compete for iterations. We employ a pool of the remaining iterations to be processed, which initially starts filled with all iterations. Until all of the iterations have been processed, each thread goes to the iteration pool, removes an iteration value, processes it, and then repeats. In this manner, we can achieve in a greedy fashion an approximation for the optimal level of load-balancing possible (the true optimum could only be achieved with a priori knowledge of exactly how long each iteration would take). If a thread gets stuck processing a particular long iteration, the other threads will compensate by processing work from the pool in the meantime. Of course, even with this scheme you can still find yourself with a far from optimal partitioning (which could occur if one thread happened to get stuck with several pieces of work significantly larger than the rest), but without knowledge of how much processing time a given piece of work will require, theres little more that can be done .

Patterns of Parallel Programming

Page 10

Heres an example implementation that takes load -balancing to this extreme. The pool of iteration values is maintained as a single integer representing the next iteration available, and the threads involved in the processing remove items by atomically incrementing this integer:

Visual Basic
Public Shared Sub MyParallelFor( ByVal inclusiveLowerBound As Integer, ByVal exclusiveUpperBound As Integer, ByVal body As Action(Of Integer)) ' Get the number of processors, initialize the number of remaining ' threads, and set the starting point for the iteration. Dim numProcs = Environment.ProcessorCount Dim remainingWorkItems = numProcs Dim nextIteration = inclusiveLowerBound Using mre As New ManualResetEvent(False) ' Create each of the work items. For p = 0 To numProcs - 1 ThreadPool.QueueUserWorkItem(Sub() Dim index As Integer index = Interlocked.Increment(nextIteration) - 1 Do While index < exclusiveUpperBound body(index) index = Interlocked.Increment(nextIteration) - 1 Loop If Interlocked.Decrement(remainingWorkItems) = 0 Then mre.Set() End Sub) Next p ' Wait for all threads to complete. mre.WaitOne() End Using End Sub

This is not a panacea, unfortunately. Weve gone to the other end of the spectrum, trading quality load -balancing for additional overheads. In our previous static partitioning implementations, threads were assigned ranges and were then able to process those ranges completely independently from the other threads. There was no need to synchronize with other threads in order to determine what to do next, because every thread could determine independently what work it needed to get done. For workloads that have a lot of work per iteration, the cost of synchronizing between threads so that each can determine what to do next is negligible. But for workloads that do very little work per iteration, that synchronization cost can be so expensive (relatively) as to overshadow the actual work being performed by the loop. This can make it more expensive to execute in parallel than to execute serially. Consider an analogy: shopping with some friends at a grocery store. You come into the store with a grocery list, and you rip the list into one piece per friend, such that every friend is responsible for retrieving the elements on his or her list. If the amount of time required to retrieve the elements on each list is approximately the same as on every other list, youve done a good job of partitioning the work amongst your team , and will likely find that your time at the store is significantly less than if you had done all of the shopping yourself. But now suppose that each list is not well balanced, with all of the items on one friends list spread out over the entire store, while all of the items on another friends list are concentrated in the same aisle. You could address Patterns of Parallel Programming Page 11

this inequity by assigning out one element at a time. Every time a friend retrieves a food item, he or she brings it back to you at the front of the store and determines in conjunction with you which food item to retrieve next. If a particular food item takes a particularly long time to retrieve, such as ordering a custom cut piece of meat at the deli counter, the overhead of having to go back and forth between you and the merchandise may be negligible. For simply retrieving a can from a shelf, however, the overhead of those trips can be dominant, especially if multiple items to be retrieved from a shelf were near each other and could have all been retrieved in the same trip with minimal additional time. You could spend so much time (relatively) parceling out work to your friends and determining what each should buy next that it would be faster for you to just grab all of the food items in your list yourself. Of course, we dont need to pick one extreme or the other . As with most patterns, there are variations on themes. For example, in the grocery store analogy, you could have each of your friends grab several items at a time, rather than grabbing one at a time. This amortizes the overhead across the size of a batch, while still having some amount of dynamism:

Visual Basic
Public Shared Sub MyParallelFor( ByVal inclusiveLowerBound As Integer, ByVal exclusiveUpperBound As Integer, ByVal body As Action(Of Integer)) ' Get the number of processors, initialize the number of remaining ' threads, and set the starting point for the iteration. Dim numProcs = Environment.ProcessorCount Dim remainingWorkItems = numProcs Dim nextIteration = inclusiveLowerBound Const batchSize = 3 Using mre As New ManualResetEvent(False) ' Create each of the work items. For p = 0 To numProcs - 1 ' In a real implementation, wed need to handle ' overflow on this arithmetic. ThreadPool.QueueUserWorkItem(Sub() Dim index As Integer index = Interlocked.Add(nextIteration, batchSize) - batchSize Do While index < exclusiveUpperBound Dim [end] As Integer = index + batchSize If [end] >= exclusiveUpperBound Then [end] = exclusiveUpperBound End If For i = index To [end] - 1 body(i) Next i index = Interlocked.Add(nextIteration, batchSize) - batchSize Loop If Interlocked.Decrement(remainingWorkItems) = 0 Then mre.Set() End Sub) Next p ' Wait for all threads to complete mre.WaitOne() End Using End Sub

Patterns of Parallel Programming

Page 12

No matter what tradeoffs you make between overheads and load-balancing, they are tradeoffs. For a particular problem, you might be able to code up a custom parallel loop algorithm mapping to this pattern that suits your particular problem best. That could result in quite a bit of custom code, however. In general, a good solution is one that provides quality results for most problems, minimizing overheads while providing sufficient load-balancing, and the .NET Framework 4 includes just such an implementation in the new System.Threading.Tasks.Parallel class.

PARALLEL.FOR
As delightfully parallel problems represent one of the most common patterns in parallel programming, its natural that when support for parallel programming is added to a mainstream library, support for delightfully parallel loops is included. The .NET Framework 4 provides this in the form of the Shared Parallel class in the new System.Threading.Tasks namespace in mscorlib.dll. The Parallel class provides just three methods, albeit each with several overloads. One of these methods is For, providing multiple signatures, one of which is almost identical to the signature for MyParallelFor shown previously:

Visual Basic
Public Shared Function [For]( ByVal fromInclusive As Integer, ByVal toExclusive As Integer, ByVal body As Action(Of Integer)) As ParallelLoopResult

As with our previous implementations, the For method accepts three parameters: an inclusive lower-bound, an exclusive upper-bound, and a delegate to be invoked for each iteration. Unlike our implementations, it also returns a ParallelLoopResult value type, which contains details on the completed loop; more on that later. Internally, the For method performs in a manner similar to our previous implementations. By default, it uses work queued to the .NET Framework ThreadPool to execute the loop, and with as much parallelism as it can muster, it invokes the provided delegate once for each iteration. However, Parallel.For and its overload set provide a whole lot more than this: Exception handling. If one iteration of the loop throws an exception, all of the threads participating in the loop attempt to stop processing as soon as possible (by default, iterations currently executing will not be interrupted, but the loop control logic tries to prevent additional iterations from starting). Once all processing has ceased, all unhandled exceptions are gathered and thrown in aggregate in an AggregateException instance. This exception type provides support for multiple inner exceptions, whereas most .NET Framework exception types support only a single inner exception. For more information about AggregateException, see http://msdn.microsoft.com/magazine/ee321571.aspx. Breaking out of a loop early. This is supported in a manner similar to the break keyword in C# and the Exit For construct in Visual Basic. Support is also provided for understanding whether the current iteration should abandon its work because of occurrences in other iterations that will cause the loop to end early. This is the primary reason for the ParallelLoopResult return value, shown in the Parallel.For signature, which helps a caller to understand if a loop ended prematurely, and if so, why. Long ranges. In addition to overloads that support working with Int32-based ranges, overloads are provided for working with Int64-based ranges. Thread-local state. Several overloads provide support for thread-local state. More information on this support will be provided later in this document in the section on aggregation patterns.

Patterns of Parallel Programming

Page 13

Configuration options. Multiple aspects of a loops execution may be controlled, including limiting the number of threads used to process the loop. Nested parallelism. If you use a Parallel.For loop within another Parallel.For loop, they coordinate with each other to share threading resources. Similarly, its ok to use two Parallel.For loops concurrently, as theyll work together to share threading resources in the underlying pool rather than both assuming they own all cores on the machine. Dynamic thread counts. Parallel.For was designed to accommodate workloads that change in complexity over time, such that some portions of the workload may be more compute-bound than others. As such, it may be advantageous to the processing of the loop for the number of threads involved in the processing to change over time, rather than being statically set, as was done in all of our implementations shown earlier. Efficient load balancing. Parallel.For supports load balancing in a very sophisticated manner, much more so than the simple mechanisms shown earlier. It takes into account a large variety of potential workloads and tries to maximize efficiency while minimizing overheads. The partitioning implementation is based on a chunking mechanism where the chunk size increases over time. This helps to ensure quality load balancing when there are only a few iterations, while minimizing overhead when there are many. In addition, it tries to ensure that most of a threads iterations are focused in the same region of the iteration space in order to provide high cache locality.

Parallel.For is applicable to a wide-range of delightfully parallel problems, serving as an implementation of this quintessential pattern. As an example of its application, the parallel programming samples for the .NET Framework 4 (available at http://code.msdn.microsoft.com/ParExtSamples) include a ray tracer. Heres a screenshot:

Ray tracing is fundamentally a delightfully parallel problem. Each individual pixel in the image is generated by firing an imaginary ray of light, examining the color of that ray as it bounces off of and through objects in the scene, and storing the resulting color. Every pixel is thus independent of every other pixel, allowing them all to be processed in parallel. Here are the relevant code snippets from that sample:

Visual Basic
Private Sub RenderSequential(ByVal scene As Scene, ByVal rgb() As Int32) Dim camera = scene.Camera For y = 0 To screenHeight - 1 Dim stride = y * screenWidth For x = 0 To screenWidth - 1 Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0)

Patterns of Parallel Programming

Page 14

rgb(x + stride) = color.ToInt32() Next x Next y End Sub Private Sub RenderParallel(ByVal scene As Scene, ByVal rgb() As Int32) Dim camera = scene.Camera Parallel.For(0, screenHeight, Sub(y) Dim stride = y * screenWidth For x = 0 To screenWidth - 1 Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0) rgb(x + stride) = Color.ToInt32() Next x End Sub) End Sub

Notice that there are very few differences between the sequential and parallel implementation, limited only to changing the C# for and Visual Basic For language constructs into the Parallel.For method call.

PARALLEL.FOREACH
A for loop is a very specialized loop. Its purpose is to iterate through a specific kind of data set, a data set made up of numbers that represent a range. The more generalized concept is iterating through any data set, and constructs for such a pattern exist in C# with the foreach keyword and in Visual Basic with the For Each construct. Consider the following for loop:

Visual Basic
For i As Integer = 0 To 9 ' ... Process i. Next i

Using the Enumerable class from LINQ, we can generate an IEnumerable<int> that represents the same range, and iterate through that range using a foreach:

Visual Basic
For Each i As Integer In Enumerable.Range(0, 10) ' ... Process i. Next i

We can accomplish much more complicated iteration patterns by changing the data returned in the enumerable. Of course, as it is a generalized looping construct, we can use a foreach to iterate through any enumerable data set. This makes it very powerful, and a parallelized implementation is similarly quite powerful in the parallel realm. As with a parallel for, a parallel for each represents a fundamental pattern in parallel programming. Implementing a parallel for each is similar in concept to implementing a parallel for. You need multiple threads to process data in parallel, and you need to partition the data, assigning the partitions to the threads doing the processing. In our dynamically partitioned MyParallelFor implementation, the data set remaining was represented by a single integer that stored the next iteration. In a for each implementation, we can store it as an IEnumerator<T> for the data set. This enumerator must be protected by a critical section so that only one thread at a time may mutate it. Here is an example implementation: Patterns of Parallel Programming Page 15

Visual Basic
public Shared Sub MyParallelForEach(Of T)( ByVal source As IEnumerable(Of T), ByVal body As Action(Of T)) Dim numProcs = Environment.ProcessorCount Dim remainingWorkItems = numProcs Using enumerator = source.GetEnumerator() Using mre As New ManualResetEvent(False) ' Create each of the work items. For p = 0 To numProcs - 1 ' Iterate until there's no more work. ' Get the next item under a lock, ' then process that item. ThreadPool.QueueUserWorkItem(Sub() Do Dim nextItem As T SyncLock enumerator If Not enumerator.MoveNext() Then Exit Do nextItem = enumerator.Current End SyncLock body(nextItem) Loop If Interlocked.Decrement(remainingWorkItems) = 0 Then mre.Set() End Sub) Next p ' Wait for all threads to complete. mre.WaitOne() End Using End Using End Sub

As with the MyParallelFor implementations shown earlier, there are lots of implicit tradeoffs being made in this implementation, and as with the MyParallelFor, they all come down to tradeoffs between simplicity, overheads, and load balancing. Taking locks is expensive, and this implementation is taking and releasing a lock for each element in the enumerable; while costly, this does enable the utmost in load balancing, as every thread only grabs one item at a time, allowing other threads to assist should one thread run into an unexpectedly expensive element. We could tradeoff some cost for some load balancing by retrieving multiple items (rather than just one) while holding the lock. By acquiring the lock, obtaining multiple items from the enumerator, and then releasing the lock, we amortize the cost of acquisition and release over multiple elements, rather than paying the cost for each element. This benefit comes at the expense of less load balancing, since once a thread has grabbed several items, it is responsible for processing all of those items, even if some of them happen to be more expensive than the bulk of the others. We can decrease costs in other ways, as well. For example, the implementation shown previously always uses the enumerators MoveNext/Current support, but it might be the case that the source input IEnumerable<T> also implements the IList<T> interface, in which case the implementation could use less costly partitioning, such as that employed earlier by MyParallelFor:

Visual Basic
Public Shared Sub MyParallelForEach(Of T)( ByVal source As IEnumerable(Of T), ByVal body As Action(Of T)) Dim sourceList = TryCast(source, IList(Of T))

Patterns of Parallel Programming

Page 16

If sourceList IsNot Nothing Then ' This assumes the IList(Of T) implementations indexer is safe ' for concurrent get access. MyParallelFor(0, sourceList.Count, Function(i) body(sourceList(i))) Else ' ... End If End Sub

As with Parallel.For, the .NET Framework 4s Parallel class provides support for this pattern, in the form of the ForEach method. Overloads of ForEach provide support for many of the same things for which overloads of For provide support, including breaking out of loops early, sophisticated partitioning, and thread count dynamism. The simplest overload of ForEach provides a signature almost identical to the signature shown above:

Visual Basic
Public Shared Function ForEach(Of TSource)( ByVal source As IEnumerable(Of TSource), ByVal body As Action(Of TSource)) As ParallelLoopResult End Function

As an example application, consider a Student record that contains a settable GradePointAverage property as well as a readable collection of Test records, each of which has a grade and a weight. We have a set of such student records, and we want to iterate through each, calculating each students grades based on the associated tests . Sequentially, the code looks as follows:

Visual Basic
For Each student In students student.GradePointAverage = student.Tests.Select(Function(test) test.Grade * test.Weight).Sum() Next student

To parallelize this, we take advantage of Parallel.ForEach:

Visual Basic
Parallel.ForEach(students, Sub(student) student.GradePointAverage = student.Tests.Select(Function(test) test.Grade * test.Weight).Sum() End Sub)

PROCESSING NON-INTEGRAL RANGES

The Parallel class in the .NET Framework 4 provides overloads for working with ranges of Int32 and Int64 values. However, for loops in languages like C# and Visual Basic can be used to iterate through non-integral ranges. Consider a type Node<T> that represents a linked list:

Visual Basic
Class Node(Of T) Public Prev As Node(Of T), [Next] As Node(Of T) Public Data As T End Class

Given an instance head of such a Node<T>, we can use a for loop to iterate through the list: Patterns of Parallel Programming Page 17

Visual Basic
Dim i As Node(Of T) = head Do While i IsNot Nothing ' ... Process node i. i = i.Next Loop

Parallel.For does not contain overloads for working with Node<T>, and Node<T> does not implement IEnumerable<T>, preventing its direct usage with Parallel.ForEach. To compensate, we can use C# iterators to create an Iterate method which will yield an IEnumerable<T> to iterate through the Node<T>:

Visual Basic
(Visual Basic does not provide an equivalent to C#s yield keyword.)

C#
public static IEnumerable<Node<T>> Iterate(Node<T> head) { for (Node<T> i = head; i != null; i = i.Next) yield return i; }

With such a method in hand, we can now use a combination of Parallel.ForEach and Iterate to approximate a Parallel.For implementation that does work with Node<T>:

Visual Basic
Parallel.ForEach(Iterate(head), Sub(i) ' ... Process node i. End Sub);

This same technique can be applied to a wide variety of scenarios. Keep in mind, however, that the IEnumerator<T> interface isnt thread-safe, which means that Parallel.ForEach needs to take locks when accessing the data source. While ForEach internally uses some smarts to try to amortize the cost of such locks over the processing, this is still overhead that needs to be overcome by more work in the body of the ForEach in order for good speedups to be achieved. Parallel.ForEach has optimizations used when working on data sources that can be indexed into, such as lists and arrays, and in those cases the need for locking is decreased (this is similar to the example implementation shown previously, where MyParallelForEach was able to use MyParallelFor in processing an IList<T>). Thus, even though there is both time and memory cost associated with creating an array from an enumerable, performance may actually be improved in some cases by transforming the iteration space into a list or an array, which can be done using LINQ. For example:

Visual Basic
Parallel.ForEach(Iterate(head).ToArray(), Sub(i) ' ... Process node i. End Sub);

The format of a for construct in C# and a For in Visual Basic may also be generalized into a generic Iterate method:

Visual Basic
(Visual Basic does not provide an equivalent to C#s yield keyword.)

Patterns of Parallel Programming

Page 18

C#
public static IEnumerable<T> Iterate<T>( Func<T> initialization, Func<T, bool> condition, Func<T, T> update) { for (T i = initialization(); condition(i); i = update(i)) yield return i; }

While incurring extra overheads for all of the delegate invocations, this now also provides a generalized mechanism for iterating. The Node<T> example can be re-implemented as follows:

Visual Basic
Parallel.ForEach(Iterate( Function() head, Function(i) i IsNot Nothing, Function(i) i.Next), Function(i) ' ... Process node i. End Function);

BREAKING OUT OF LOOPS EARLY

Exiting out of loops early is a fairly common pattern, one that doesnt go away when parallelism is introduced . To help simplify these use cases, the Parallel.For and Parallel.ForEach methods support several mechanisms for breaking out of loops early, each of which has different behaviors and targets different requirements.

PLANNED EXIT
Several overloads of Parallel.For and Parallel.ForEach pass a ParallelLoopState instance to the body delegate. Included in this types surface area are four members relevant to this discussion: methods Stop and Break, and properties IsStopped and LowestBreakIteration. When an iteration calls Stop, the loop control logic will attempt to prevent additional iterations of the loop from starting. Once there are no more iterations executing, the loop method will return successfully (that is, without an exception). The return type of Parallel.For and Parallel.ForEach is a ParallelLoopResult value type: if Stop caused the loop to exit early, the results IsCompleted property will return false.

Visual Basic
Dim loopResult = Parallel.For(0, N, Sub(i As Integer, loop As ParallelLoopState) ' ... If someCondition Then loop.Stop() Return End If ' ... End Sub) Console.WriteLine("Ran to completion: " & loopResult.IsCompleted)

For long running iterations, the IsStopped property enables one iteration to detect when another iteration has called Stop in order to bail earlier than it otherwise would:

Patterns of Parallel Programming

Page 19

Visual Basic
Dim loopResult = Parallel.For(0, N, Sub(i As Integer, loop As ParallelLoopState) ' ... If someCondition Then loop.Stop() Return End If ' ... While True If loop.IsStopped Then Return ' ... End While End Sub)

Break is very similar to Stop, except Break provides additional guarantees. Whereas Stop informs the loop control logic that no more iterations need be run, Break informs the control logic that no iterations after the current one need be run (for example, where the iteration number is higher or where the data comes after the current element in the data source), but that iterations prior to the current one still need to be run. It doesnt guarantee that iterations after the current one havent already run or started running, though it will try to avoid more starting after the current one. Break may be called from multiple iterations, and the lowest iteration from which Break was called is the one that takes effect; this iteration number can be retrieved from the ParallelLoopStates LowestBreakIteration property, a nullable value. ParallelLoopResult offers a similar LowestBreakIteration property. This leads to a decision matrix that can be used to interpret a ParallelLoopResult: IsCompleted == true o All iterations were processed. o If IsCompleted == true, LowestBreakIteration.HasValue will be false. IsCompleted == false && LowestBreakIteration.HasValue == false o Stop was used to exit the loop early IsCompleted == false && LowestBreakIteration.HasValue == true o Break was used to exit the loop early, and LowestBreakIteration.Value contains the lowest iteration from which Break was called.

Here is an example of using Break with a loop:

Visual Basic
Dim output = New TResult(N - 1) {} Dim loopResult = Parallel.For(0, N, Sub(i As Integer, [loop] As ParallelLoopState) If someCondition Then [loop].Break() Return End If output(i) = Compute(i) End Sub) Dim completedUpTo = N If (Not loopResult.IsCompleted) AndAlso loopResult.LowestBreakIteration.HasValue Then completedUpTo = loopResult.LowestBreakIteration.Value End If

Patterns of Parallel Programming

Page 20

Stop is typically useful for unordered search scenarios, where the loop is looking for something and can bail as soon as it finds it. Break is typically useful for ordered search scenarios, where all of the data up until some point in the source needs to be processed, with that point based on some search criteria.

UNPLANNED EXIT
The previously mentioned mechanisms for exiting a loop early are based on the body of the loop performing an action to bail out. Sometimes, however, we want an entity external to the loop to be able to request that the loop terminate; this is known as cancellation. Cancellation is supported in parallel loops through the new System.Threading.CancellationToken type introduced in the .NET Framework 4. Overloads of all of the methods on Parallel accept a ParallelOptions instance, and one of the properties on ParallelOptions is a CancellationToken. Simply set this CancellationToken property to the CancellationToken that should be monitored for cancellation, and provide that options instance to the loops invocation. The loop will monitor the token, and if it finds that cancellation has been requested, it will again stop launching more iterations, wait for all existing iterations to complete, and then throw an OperationCanceledException.

Visual Basic
Private _cts As New CancellationTokenSource() ' ... Dim options = New ParallelOptions With {.CancellationToken = _cts.Token} Try Parallel.For(0, N, options, Sub(i) '... End Sub) Catch oce As OperationCanceledException ' ... Handle loop cancellation. End Try

Stop and Break allow a loop itself to proactively exit early and successfully, and cancellation allows an external entity to the loop to request its early termination. Its also possible for something in the loops body to go wrong, resulting in an early termination of the loop that was not expected. In a sequential loop, an unhandled exception thrown out of a loop causes the looping construct to immediately cease. The parallel loops in the .NET Framework 4 get as close to this behavior as is possible while still being reliable and predictable. This means that when an exception is thrown out of an iteration, the Parallel methods attempt to prevent additional iterations from starting, though already started iterations are not forcibly terminated. Once all iterations have ceased, the loop gathers up any exceptions that have been thrown, wraps them in a System.AggregateException, and throws that aggregate out of the loop. As with Stop and Break, for cases where individual operations may run for a long time (and thus may delay the loops exit), it may be advantageous for iterations of a loop to be able to check whether other iterations have faulted. To accommodate that, ParallelLoopState exposes an IsExceptional property (in addition to the aforementioned IsStopped and LowestBreakIteration properties), which indicates whether another iteration has thrown an unhandled exception. Iterations may cooperatively check this property, allowing a long-running iteration to cooperatively exit early when it detects that another iteration failed.

Patterns of Parallel Programming

Page 21

While this exception logic does support exiting out of a loop early, it is not the recommended mechanism for doing so. Rather, it exists to assist in exceptional cases, cases where breaking out early wasnt an intentional part of the algorithm. As is the case with sequential constructs, exceptions should not be relied upon for control flow. Note, too, that this exceptions behavior isnt optional. In the face of unhandled exceptions, theres no way to tell the looping construct to allow the entire loop to complete execution, just as theres no built-in way to do that with a serial for loop. If you wanted that behavior with a serial for loop, youd likely end up writing code like the following:

Visual Basic
Dim exceptions = New Queue(Of Exception)() For i As Integer = 0 To N - 1 Try ' ... Loop body goes here. Catch exc As Exception exceptions.Enqueue(exc) End Try Next i If exceptions.Count > 0 Then Throw New AggregateException(exceptions)

If this is the behavior you desire, that same manual handling is also possible using Parallel.For:

Visual Basic
Dim exceptions = New ConcurrentQueue(Of Exception)() Parallel.For(0, N, Sub(i) Do Try ' ... Loop body goes here. Catch exc As Exception exceptions.Enqueue(exc) End Try End Sub) If Not exceptions.IsEmpty Then Throw New AggregateException(exceptions)

EMPLOYING MULTIPLE EXIT STRATEGIES

Its possible that multiple exit strategies could all be employed together, concurrently; were dealing with parallelism, after all. In such cases, exceptions always win: if unhandled exceptions have occurred, the loop will always propagate those exceptions, regardless of whether Stop or Break was called or whether cancellation was requested. If no exceptions occurred but the CancellationToken was signaled and either Stop or Break was used, theres a potential race as to whether the loop will notice the cancellation prior to exiting. If it does, the loop will exit with an OperationCanceledException. If it doesnt, it will exit due to the Stop/Break as explained previously. However, Stop and Break may not be used together. If the loop detects that one iteration called Stop while another called Break, the invocation of whichever method ended up being invoked second will result in an exception being thrown. This is enforced due to the conflicting guarantees provided by Stop and Break. For long running iterations, there are multiple properties an iteration might want to check to see whether it should bail early: IsStopped, LowestBreakIteration, IsExceptional, and so on. To simplify this, ParallelLoopState also

Patterns of Parallel Programming

Page 22

provides a ShouldExitCurrentIteration property, which consolidates all of those checks in an efficient manner. The loop itself checks this value prior to invoking additional iterations.

PARALLELENUMERABLE.FORALL
Parallel LINQ (PLINQ), exposed from System.Core.dll in the .NET Framework 4, provides a parallelized implementation of all of the .NET Framework standard query operators. This includes Select (projections), Where (filters), OrderBy (sorting), and a host of others. PLINQ also provides several additional operators not present in its serial counterpart. One such operator is AsParallel, which enables parallel processing of a LINQ-to-Objects query. Another such operator is ForAll. Partitioning of data has already been discussed to some extent when discussing Parallel.For and Parallel.ForEach, and merging will be discussed in greater depth later in this document. Suffice it to say, however, that to process an input data set in parallel, portions of that data set must be distributed to each thread partaking in the processing, and when all of the processing is complete, those partitions typically need to be merged back together to form the single output stream expected by the caller:

Visual Basic
Dim inputData As List(Of InputData) =... For Each o In inputData.AsParallel().Select(Function(i) New OutputData(i)) ProcessOutput(o) Next o

Both partitioning and merging incur costs, and in parallel programming, we strive to avoid such costs as theyre pure overhead when compared to a serial implementation. Partitioning cant be avoided if data must be processed in parallel, but in some cases we can avoid merging, such as if the work to be done for each resulting item can be processed in parallel with the work for every other resulting item. To accomplish this, PLINQ provides the ForAll operator, which avoids the merge and executes a delegate for each output element:

Visual Basic
Dim inputData As List(Of InputData) = ... inputData.AsParallel().Select(Function(i) New OutputData(i)).ForAll( Sub(o) ProcessOutput(o))

ANTI-PATTERNS
Superman has his kryptonite. Matter has its anti-matter. And patterns have their anti-patterns. Patterns prescribe good ways to solve certain problems, but that doesnt mean theyre not without potential pitfalls . There are several potential problems to look out for with Parallel.For, Parallel.ForEach, and ParallelEnumerable.ForAll.

SHARED DATA
The new parallelism constructs in the .NET Framework 4 help to alleviate most of the boilerplate code youd otherwise have to write to parallelize delightfully parallel problems. As you saw earlier, the amount of code necessary just to implement a simple and nave MyParallelFor implementation is vexing, and the amount of code required to do it well is reams more. These constructs do not, however, automatically ensure that your code is

Patterns of Parallel Programming

Page 23

thread-safe. Iterations within a parallel loop must be independent, and if theyre not independent, you must ensure that the iterations are safe to execute concurrently with each other by doing the appropriate synchronization.

ITERATION VARIANTS
In managed applications, one of the most common patterns used with a for/For loop is iterating from 0 inclusive to some upper bound (typically exclusive in C# and inclusive in Visual Basic). However, there are several variations on this pattern that, while not nearly as common, are still not rare.

DOWNWARD ITERATION
Its not uncommon to see loops iterating down from an upper-bound exclusive to 0 inclusive:

Visual Basic
For i = upperBound - 1 To 0 Step -1 '... Next i

Such a loop is typically (though not always) constructed due to dependencies between the iterations; after all, if all of the iterations are independent, why write a more complex form of the loop if both the upward and downward iteration have the same results? Parallelizing such a loop is often fraught with peril, due to these likely dependencies between iterations. If there are no dependencies between iterations, the Parallel.For method may be used to iterate from an inclusive lower bound to an exclusive upper bound, as directionality shouldnt matter: in the extreme case of parallelism, on a machine with upperBound number of cores, all iterations of the loop may execute concurrently, and direction is irrelevant. When parallelizing downward-iterating loops, proceed with caution. Downward iteration is often a sign of a less than delightfully parallel problem.

STEPPED ITERATION
Another pattern of a for loop that is less common than the previous cases, but still is not rare, is one involving a step value other than one. A typical for loop may look like this:

Visual Basic
For i = 0 To upperBound - 1 '... Next i

But its also possible for the update statement to increase the iterat ion value by a different amount: for example to iterate through only the even values between the bounds:

Visual Basic
For i = 0 To upperBound - 1 Step 2 '... Next i

Patterns of Parallel Programming

Page 24

Parallel.For does not provide direct support for such patterns. However, Parallel can still be used to implement such patterns. One mechanism for doing so is through an iterator approach like that shown earlier for iterating through linked lists:

Visual Basic
(Visual Basic does not provide an equivalent to C#s yield keyword.)

C#
private static IEnumerable<int> Iterate( int fromInclusive, int toExclusive, int step) { for (int i = fromInclusive; i < toExclusive; i += step) yield return i; }

A Parallel.ForEach loop can now be used to perform the iteration. For example, the previous code snippet for iterating the even values between 0 and upperBound can be coded as:

Visual Basic
Parallel.ForEach(Iterate(0, upperBound, 2), Sub(i) '... End Sub)

As discussed earlier, such an implementation, while straightforward, also incurs the additional costs of forcing the Parallel.ForEach to takes locks while accessing the iterator. This drives up the per-element overhead of parallelization, demanding that more work be performed per element to make up for the increased overhead in order to still achieve parallelization speedups. Another approach is to do the relevant math manually. Here is an implementation of a ParallelForWithStep loop that accepts a step parameter and is built on top of Parallel.For:

Visual Basic
Public Shared Sub ParallelForWithStep( ByVal fromInclusive As Integer, ByVal toExclusive As Integer, ByVal [step] As Integer, ByVal body As Action(Of Integer)) If [step] < 1 Then Throw New ArgumentOutOfRangeException("step") ElseIf [step] = 1 Then Parallel.For(fromInclusive, toExclusive, body) Else ' step > 1 Dim len = CInt(Fix(Math.Ceiling( (toExclusive - fromInclusive) / CDbl([step])))) Parallel.For(0, len, Function(i) body(fromInclusive + (i * [step]))) End If End Sub

This approach is less flexible than the iterator approach, but it also involves significantly less overhead. Threads are not bottlenecked serializing on an enumerator; instead, they need only pay the cost of a small amount of math plus an extra delegate invocation per iteration.

Patterns of Parallel Programming

Page 25

VERY SMALL LOOP BODIES

As previously mentioned, the Parallel class is implemented in a manner so as to provide for quality load balancing while incurring as little overhead as possible. There is still overhead, though. The overhead incurred by Parallel.For is largely centered around two costs: 1) Delegate invocations. If you squint at previous examples of Parallel.For, a call to Parallel.For looks a lot like a C# for loop or a Visual Basic For loop. Dont be fooled: its still a method call. One consequence of this is that the body of the Parallel.For loop is supplied to the method call as a delegate. Invoking a delegate incurs approximately the same amount of cost as a virtual method call. 2) Synchronization between threads for load balancing. While these costs are minimized as much as possible, any amount of load balancing will incur some cost, and the more load balancing employed, the more synchronization is necessary. For medium to large loop bodies, these costs are largely negligible. But as the size of the loops body decreases, the overheads become more noticeable. And for very small bodies, the loop can be completely dominated by this overheads cost. To support parallelization of very small loop bodies requires addressing both #1 and #2 above. One pattern for this involves chunking the input into ranges, and then instead of replacing a sequential loop with a parallel loop, wrapping the sequential loop with a parallel loop. The System.Concurrent.Collections.Partitioner class provides a Create method overload that accepts an integral range and returns an OrderablePartitioner<Tuple<Int32,Int32>> (a variant for Int64 instead of Int32 is also available):

Visual Basic
Public Shared Function Create( ByVal fromInclusive As Long, ByVal toExclusive As Long) As OrderablePartitioner(Of Tuple(Of Long, Long)) End Function

Overloads of Parallel.ForEach accept instances of Partitioner<T> and OrderablePartitioner<T> as sources, allowing you to pass the result of a call to Partitioner.Create into a call to Parallel.ForEach. For now, think of both Partitioner<T> and OrderablePartitioner<T> as an IEnumerable<T>. The Tuple<Int32,Int32> represents a range from an inclusive value to an exclusive value. Consider the following sequential loop:

Visual Basic
For i = from To [to] - 1 ' ... Process i. Next i

We could use a Parallel.For to parallelize it as follows:

Visual Basic
Parallel.For(from, [to], Sub(i) // ... Process i. End Sub);

Patterns of Parallel Programming

Page 26

Or, we could use Parallel.ForEach with a call to Partitioner.Create, wrapping a sequential loop over the range provided in the Tuple<Int32, Int32>, where the inclusiveLowerBound is represented by the tuples Item1 and where the exclusiveUpperBound is represented by the tuples Item2:

Visual Basic
Parallel.ForEach(Partitioner.Create(from, [to]), Sub(range) For i = range.Item1 To range.Item2 - 1 ' ... process i Next End Sub)

While more complex, this affords us the ability to process very small loop bodies by eschewing some of the aforementioned costs. Rather than invoking a delegate for each body invocation, were now amortizing the cost of the delegate invocation across all elements in the chunked range. Additionally, as far as the parallel loop is concerned, there are only a few elements to be processed: each range, rather than each index. This implicitly decreases the cost of synchronization because there are fewer elements to load-balance. While Parallel.For should be considered the best option for parallelizing for loops, if performance measurements show that speedups are not being achieved or that theyre smaller than expected, you can try an approach like the one shown using Parallel.ForEach in conjunction with Partitioner.Create.

TOO FINE-GRAINED, TOO COARSE GRAINED

The previous anti-pattern outlined the difficulties that arise from having loop bodies that are too small. In addition to problems that implicitly result in such small bodies, its also possible to end up in this situation by decomposing the problem to the wrong granularity. Earlier in this section, we demonstrated a simple parallelized ray tracer:

Visual Basic
Private Sub RenderParallel(ByVal scene As Scene, ByVal rgb() As Int32) Dim camera = scene.Camera Parallel.For(0, screenHeight, Sub(y) Dim stride = y * screenWidth For x = 0 To screenWidth - 1 Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0) rgb(x + stride) = color.ToInt32() Next x End Sub) End Sub

Note that there are two loops here, both of which are actually safe to parallelize:

Patterns of Parallel Programming

Page 27

Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0) rgb(x + stride) = color.ToInt32() End Sub) End Sub) End Sub

The question then arises: why and when someone would choose to parallelize one or both of these loops? There are multiple, competing principles. On the one hand, the idea of writing parallelized software that scales to any number of cores you throw at it implies that you should decompose as much as possible, so that regardless of the number of cores available, there will always be enough work to go around. This principle suggests both loops should be parallelized. On the other hand, weve already seen the performance implications that can result if theres not enough work inside of a parallel loop to warrant its parallelization , implying that only the outer loop should be parallelized in order to maintain a meaty body. The answer is that the best balance is found through performance testing. If the overheads of parallelization are minimal as compared to the work being done, parallelize as much as possible: in this case, that would mean parallelizing both loops. If the overheads of parallelizing the inner loop would degrade performance on most systems, think twice before doing so, as itll likely be best only to parallelize the outer loop. There are of course some caveats to this (in parallel programming, there are caveats to everything; there are caveats to the caveats). Parallelization of only the outer loop demands that the outer loop has enough work to saturate enough processors. In our ray tracer example, what if the image being ray traced was very wide and short, such that it had a small height? In such a case, there may only be a few iterations for the outer loop to parallelize, resulting in too coarse-grained parallelization, in which case parallelizing the inner loop could actually be beneficial, even if the overheads of parallelizing the inner loop would otherwise not warrant its parallelization. Another option to consider in such cases is flattening the loops, such that you end up with one loop instead of two. This eliminates the cost of extra partitions and merges that would be incurred on t he inner loops parallelization:

Visual Basic
Private Sub RenderParallel(ByVal scene As Scene, ByVal rgb() As Int32) Dim totalPixels = screenHeight * screenWidth Dim camera = scene.Camera Parallel.For(0, totalPixels, Sub(i) Dim y = i / screenWidth, x As Integer = i Mod screenWidth Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0) rgb(i) = color.ToInt32() End Sub) End Sub

If in doing such flattening the body of the loop becomes too small (which given the cost of TraceRay in this example is unlikely), the pattern presented earlier for very small loop bodies may also be employed:

Visual Basic
Private Sub RenderParallel(ByVal scene As Scene, ByVal rgb() As Int32) Dim totalPixels = screenHeight * screenWidth Dim camera = scene.Camera Parallel.ForEach(Partitioner.Create(0, totalPixels), Sub(range) For i = range.Item1 To range.Item2 - 1 Dim y = i / screenWidth, x As Integer = i Mod screenWidth

Patterns of Parallel Programming

Page 28

Dim color = TraceRay( New Ray(camera.Pos, GetPoint(x, y, camera)), scene, 0) rgb(i) = color.ToInt32() Next i End Sub) End Sub

NON-THREAD-SAFE ILIST<T> IMPLEMENTATIONS

Both PLINQ and Parallel.ForEach query their data sources for several interface implementations. Accessing an IEnumerable<T> incurs significant cost, due to needing to lock on the enumerator and make virtual methods calls to MoveNext and Current for each element. In contrast, getting an element from an IList<T> can be done without locks, as elements of an IList<T> are independent. Thus, both PLINQ and Parallel.ForEach automatically use a sources IList<T> implementation if one is available. In most cases, this is the right decision. However, in very rare cases, an implementation of IList<T> may not be thread-safe for reading due to the get accessor for the lists indexer mutating shared state . There are two predominant reasons why an implementation might do this: 1. The data structures stores data in a non-indexible manner, such that it must traverse the data structure to find the requested index. In such a case, the data structure may try to amortize the cost of access by keeping track of the last element accessed, assuming that accesses will occur in a largely sequential manner, making it cheaper to start a search from the previously accessed element than starting from scratch. Consider a theoretical linked list implementation as an example. A linked list does not typically nd support direct indexing; rather, if you want to access the 42 element of the list, you need to start at the beginning, prior to the head, and move to the next element 42 times. As an optimization, the list could maintain a reference to the most recently accessed element. If you accessed element 42 and then nd element 43, upon accessing 42 the list would cache a reference to the 42 element, thus making access to 43 a single move next rather than 43 of them from the beginning. If the implementation doesnt take thread-safety into account, these mutations are likely not thread-safe. Loading the data structure is expensive. In such cases, the data can be lazy-loaded (loaded on first access) to defer or avoid some of the initialization costs. If getting data from the list forces initialization, then mutations could occur due to indexing into the list. There are only a few, obscure occurrences of this in the .NET Framework. One example is System.Data.Linq.EntitySet<TEntity>. This type implements IList<TEntity> with support for lazy loading, such that the first thing its indexers get accessor does is load the data into the EntitySet<TEntity> if loading hasnt already occurred. To work around such cases if you do come across them, you can force both PLINQ and Parallel.ForEach to use the IEnumerable<T> implementation rather than the IList<T> implementation. This can be achieved in two ways: 1) Use System.Collections.Concurrent.Partitioners Create method . There is an overload specific to IEnumerable<T> that will ensure this interface implementation (not one for IList<T>) is used. Partitioner.Create returns an instance of a Partitioner<T>, for which there are overloads on Parallel.ForEach and in PLINQ. Patterns of Parallel Programming Page 29

Visual Basic
' Will use IList(Of T) implementation if source implements it. Dim source as IEnumerable(Of T) = ... Parallel.ForEach(source, Sub(item) ' ... End Sub) ' Will use sources IEnumerable<T> implementation. Dim source as IEnumerable(Of T) = ... Parallel.ForEach(Partitioner.Create(source), Sub(item) ' ... End Sub)

2) Append onto the data source a call to Enumerable.Select. The Select simply serves to prevent PLINQ and Parallel.ForEach from finding the original sources IList<T> implementation.

Visual Basic
' Will use IList(Of T) implementation if source implements it. Dim source as IEnumerable(Of T) = ... Parallel.ForEach(source, Sub(item) ' ... End Sub) ' Will use sources IEnumerable<T> implementation. Dim source as IEnumerable(Of T) = ... Parallel.ForEach(source.Select(Function(t) t), Sub(item) ' ... End Sub)

PARALLEL.FOREACH OVER A PARALLELQUERY<T>

PLINQs ParallelEnumerable type operates in terms of ParallelQuery<T> objects. Such objects are returned from the AsParallel extension method, and all of PLINQs operators consume and generate instances of ParallelQuery<T>. ParallelQuery<T> is itself an IEnumerable<T>, which means it can be iterated over and may be consumed by anything that understands how to work with an IEnumerable<T>. Parallel.ForEach is one such construct that works with IEnumerable<T>. As such, it may be tempting to write code that follows a pattern similar to the following:

Visual Basic
Dim q = From d In data.AsParallel() ... Select d Parallel.ForEach(q, Sub(item) ' ... Process item. End Sub)

While this works correctly, it incurs unnecessary costs. In order for PLINQ to stream its output data into an IEnumerable<T>, PLINQ must merge the data being generated by all of the threads involved in query processing so that the multiple sets of data can be consumed by code expecting only one. Conversely, when accepting an input IEnumerable<T>, Parallel.ForEach must consume the single data stream and partition it into multiple data streams for processing in parallel. Thus, by passing a ParallelQuery<T> to a Parallel.ForEach, in the .NET Framework 4 the data from the PLINQ query will be merged and will then be repartitioned by the Parallel.ForEach. This can be costly. Patterns of Parallel Programming Page 30

PLINQ Query Partition Partition .. Partition IEnumerable<T>

Parallel.ForEach Partition Partition .. Partition

Instead, PLINQs ParallelEnumerable.ForAll method should be used. Rewriting the previous code as follows will avoid the spurious merge and repartition:

Visual Basic
Dim q = From d In data.AsParallel() ... Select d q.ForAll(Sub(item) ' ... Process item. End Sub)

This allows the output of all partitions to be processed in parallel, as discussed in the previous section on ParallelEnumerable.ForAll.

PLINQ Query Partition Partition .. Partition Action<TSource> Action<TSource> Action<TSource> Action<TSource>

THREAD AFFINITY IN ACCESSING SOURCE DATA

Both Parallel.ForEach and ParallelEnumerable.ForAll rely on each of the threads participating in the loop to pull data from the source enumerator. While both ForEach and ForAll ensure that the enumerator is accessed in a thread-safe manner (only one thread at a time will use MoveNext and Current, and will do so atomically with respect to other threads in the loop), its still the case that multiple threads may use MoveNext over time. In general, this shouldnt be a problem. However, in some rare cases the implementation of MoveNext may have thread affinity, meaning that for correctness purposes it should always be accessed from the same thread, and perhaps even from a specific thread. An example of this could be if MoveNext were accessing a user interface (UI) control in Windows Forms or Windows Presentation Foundation in order to retrieve its data, or if the control were pulling data from the object model of one of the Microsoft Office applications. While such thread affinity is not recommended, avoiding it may not be possible.

Patterns of Parallel Programming

Page 31

In such cases, the consuming implementation needs to change to ensure that the data source is only accessed by the thread making the call to the loop. That can be achieved with a producer/consumer pattern (many more details on that pattern are provided later in this document), using code similar in style to the following:

Visual Basic
Shared Sub ForEachWithEnumerationOnMainThread(Of T)( ByVal source As IEnumerable(Of T), ByVal body As Action(Of T)) Dim collectedData = New BlockingCollection(Of T)() Dim [loop] = Task.Factory.StartNew(Sub() Parallel.ForEach(collectedData.GetConsumingEnumerable(), body)) Try For Each item In source collectedData.Add(item) Next item Finally collectedData.CompleteAdding() End Try [loop].Wait() End Sub

The Parallel.ForEach executes in the background by pulling the data from a shared collection that is populated by the main thread enumerating the data source and copying its contents into the shared collection. This solves the issue of thread affinity with the data source by ensuring that the data source is only accessed on the main thread. If, however, all access to the individual elements must also be done only on the main thread, parallelization is infeasible.

PARALLEL LOOPS FOR I/O-BOUND WORKLOADS IN SCALABLE APPLICATIONS

It can be extremely tempting to utilize the delightfully parallel looping constructs in the .NET Framework 4 for I/Obound workloads. And in many cases, its quite reasonable to do so as a quick-and-easy approach to getting up and running with better performance. Consider the need to ping a set of machines. We can do this quite easily using the System.Net.NetworkInformation.Ping class, along with LINQ:

Visual Basic
Dim addrs = { addr1, addr2,..., addrN } Dim pings = From addr In addrs Select New Ping().Send(addr) For Each ping In pings Console.WriteLine("{0}: {1}", ping.Status, ping.Address) Next ping

By adding just a few characters, we can easily parallelize this operation using PLINQ:

Visual Basic
Dim pings = From addr In addrs.AsParallel() Select New Ping().Send(addr) For Each ping In pings Console.WriteLine("{0}: {1}", ping.Status, ping.Address) Next ping

Patterns of Parallel Programming

Page 32

Rather than using a single thread to ping these machines one after the other, this code uses multiple threads to do so, typically greatly decreasing the time it takes to complete the operation. Of course, in this case, the work Im doing is not at all CPU-bound, and yet by default PLINQ uses a number of threads equal to the number of logical processors, an appropriate heuristic for CPU-bound workloads but not for I/O-bound. As such, we can utilize PLINQs WithDegreeOfParallelism method to get the work done even faster by using more threads (assuming there are enough addresses being pinged to make good use of all of these threads):

Visual Basic
Dim pings = From addr In addrs.AsParallel()().WithDegreeOfParallelism(16) Select New Ping().Send(addr) For Each ping In pings Console.WriteLine("{0}: {1}", ping.Status, ping.Address) Next ping

For a client application on a desktop machine doing just this one operation, using threads in this manner typically does not lead to any significant problems. However, if this code were running in an ASP.NET application, it could be deadly to the system. Threads have a non-negligible cost, a cost measurable in both the memory required for their associated data structures and stack space, and in the extra impact it places on the operating system and its scheduler. When threads are doing real work, this cost is justified. But when threads are simply sitting around blocked waiting for an I/O operation to complete, theyre dead weight . Especially in Web applications, where thousands of users may be bombarding the system with requests, that extra and unnecessary weight can bring a server to a crawl. For applications where scalability in terms of concurrent users is at a premium, its imperative not to write code like that shown above, even though its really simple to write . There are other solutions, however. WithDegreeOfParallelism changes the number of threads required to execute and complete the PLINQ query, but it does not force that number of threads into existence. If the number is larger than the number of threads available in the ThreadPool, it may take some time for the ThreadPool thread-injection logic to inject enough threads to complete the processing of the query. To force it to get there faster, you can employ the ThreadPool.SetMinThreads method. The System.Threading.Tasks.Task class will be discussed later in this document. In short, however, note that a Task instance represents an asynchronous operation. Typically these are computationally-intensive operations, but the Task abstraction can also be used to represent I/O-bound operations and without tying up a thread in the process. As an example of this, the samples available at http://code.msdn.microsoft.com/ParExtSamples include extension methods for the Ping class that provide asynchronous versions of the Send method to return a Task<PingReply>. Using such methods, we can rewrite our previous method as follows:

Visual Basic
Dim pings = (From addr In addrs Select New Ping().SendTask(addr, Nothing)).ToArray() Task.WaitAll(pings) For Each ping As Task(Of PingReply) In pings Console.WriteLine("{0}: {1}", ping.Result.Status, ping.Result.Address) Next ping

This new solution will asynchronously send a ping to all of the addresses, but no threads (other than the main thread waiting on the results) will be blocked in the process; only when the pings complete will threads be utilized Patterns of Parallel Programming Page 33

briefly to process the results, the actual computational work. This results in a much more scalable solution, one that may be used in applications that demand scalability. Note, too, that taking advantage of Task.Factory.ContinueWhenAll (to be discussed later), the code can even avoid blocking the main iteration thread, as illustrated in the following example:

Visual Basic
Dim pings = (From addr In addrs Select New Ping().SendTask(addr, Nothing)).ToArray() Task.Factory.ContinueWhenAll(pings, Sub(t) Task.WaitAll(pings) For Each ping As Task(Of PingReply) In pings Console.WriteLine("{0}: {1}", ping.Result.Status, ping.Result.Address); Next ping End Sub)

The example here was shown utilizing the Ping class, which implements the Event-based Asynchronous Pattern (EAP). This pattern for asynchronous operation was introduced in the .NET Framework 2.0, and is based on .NET events that are raised asynchronously when an operation completes. A more prevalent pattern throughout the .NET Framework is the Asynchronous Programming Model (APM) pattern, which has existed in the .NET Framework since its inception. Sometimes referred to as the begin/end pattern, this pattern is based on a pair of methods: a begin method that starts the asynchronous operation, and an end method that joins with it, retrieving any results of the invocation or the exception from the operation . To help integrate with this pattern, the aforementioned Task class can also be used to wrap an APM invocation, which can again help with the scalability, utilizing the Task.Factory.FromAsync method. This support can then be used to build an approximation of asynchronous methods, as is done in the Task.Factory.Iterate extension method available in the samples at samples available at http://code.msdn.microsoft.com/ParExtSamples. For more information, see http://blogs.msdn.com/pfxteam/9809774.aspx. Through its asynchronous workflow functionality, F# in Visual Studio 2010 also provides first-class language support for writing asynchronous methods. For more information, see http://msdn.microsoft.com/en-us/library/dd233182(VS.100).aspx. The incubation language Axum, available for download at http://msdn.microsoft.com/en-us/devlabs/dd795202.aspx, also includes firstclass language support for writing asynchronous methods.

Patterns of Parallel Programming

Page 34

FORK/JOIN
The patterns employed for delightfully parallel loops are really a subset of a larger set of patterns centered around fork/join. In fork/join patterns, work is forked such that several pieces of work are launched asynchronously . That forked work is later joined with in order to ensure that all of the processing has completed, and potentially to retrieve the results of that processing if it wasnt utilized entirely for side-effecting behavior. Loops are a prime example of this: we fork the processing of loop iterations, and we join such that the parallel loop invocation only completes when all concurrent processing is done. The new System.Threading.Tasks namespace in the .NET Framework 4 contains a significant wealth of support for fork/join patterns. In addition to the Parallel.For, Parallel.ForEach, and PLINQ constructs already discussed, the .NET Framework provides the Parallel.Invoke method, as well as the new Task and Task<TResult> types. The new System.Threading.CountdownEvent type also helps with fork/join patterns, in particular for when dealing with concurrent programming models that dont provide built-in support for joins.

COUNTING DOWN
A primary component of fork/join pattern implementations is keeping track of how much still remains to be completed. We saw this in our earlier MyParallelFor and MyParallelForEach implementations, with the loop storing a count for the number of work items that still remained to be completed, and a ManualResetEvent that would be signaled when this count reached 0. Support for this pattern is codified into the new System.Threading.CountdownEvent type in the .NET Framework 4. Below is a code snippet from earlier for implementing the sample MyParallelFor, now modified to use CountdownEvent.

Visual Basic
Shared Sub MyParallelFor( ByVal fromInclusive As Integer, ByVal toExclusive As Integer, ByVal body As Action(Of Integer)) Dim numProcs As Integer = Environment.ProcessorCount Dim nextIteration As Integer = fromInclusive Using ce As New CountdownEvent(numProcs) For p As Integer = 0 To numProcs - 1 ThreadPool.QueueUserWorkItem(Sub() Dim index As Integer index = Interlocked.Increment(nextIteration) - 1 Do While index < toExclusive body(index) index = Interlocked.Increment(nextIteration) - 1 Loop ce.Signal() End Sub) Next p ce.Wait() End Using End Sub

Using CountdownEvent frees us from having to manage a count manually. Instead, the event is initialized with the expected number of signals, each thread signals the event when the thread completes its processing, and the main thread waits on the event for all signals to be received.

Patterns of Parallel Programming

Page 35

COUNTING UP AND DOWN

Counting down is often employed in parallel patterns, but so is incorporating some amount of counting up. If the remaining count represents the number of work items to be completed, and we end up adding more work items after setting the initial count, the count will need to be increased. Here is an example of implementing a MyParallelForEach that launches one asynchronous work item per element to be processed. Since we dont know ahead of time how many elements there will be, we add a count of 1 for each element before launching it, and when the work item completes we signal the event.

Visual Basic
Shared Sub MyParallelForEach(Of T)( ByVal source As IEnumerable(Of T), ByVal body As Action(Of T)) Using ce As New CountdownEvent(1) For Each item In source ce.AddCount(1) ThreadPool.QueueUserWorkItem(Sub(state) Try body(CType(state, T)) Finally ce.Signal() End Try End Sub, item) Next item ce.Signal() ce.Wait() End Using End Sub

Note that the event is initialized with a count of 1. This is a common pattern in these scenarios, as we need to ensure that the event isnt set prior to all work items completing . If the count instead started at 0, and the first work item started and completed prior to our adding count for additional elements, the CountdownEvent would transition to a set state prematurely. By initializing the count to 1, we ensure that the event has no chance of reaching 0 until we remove that initial count, which is done in the above example by calling Signal after all elements have been queued.

PARALLEL.INVOKE
As shown previously, the Parallel class provides support for delightfully parallel loops through the Parallel.For and Parallel.ForEach methods. Parallel also provides support for patterns based on parallelized regions of code, where every statement in a region may be executed concurrently. This support, provided through the Parallel.Invoke method, enables a developer to easily specify multiple statements that should execute in parallel, and as with Parallel.For and Parallel.ForEach, Parallel.Invoke takes care of issues such as exception handling, synchronous invocation, scheduling, and the like:

Visual Basic
Parallel.Invoke( Sub() ComputeMean(), Sub() ComputeMedian() Sub() ComputeMode())

Patterns of Parallel Programming

Page 36

Invoke itself follows patterns internally meant to help alleviate overhead. As an example, if you specify only a few delegates to be executed in parallel, Invoke will likely spin up one Task per element. However, if you specify many delegates, or if you specify ParallelOptions for how those delegates should be invoked, Invoke will likely instead choose to execute its work in a different manner. Looking at the signature for Invoke, we can see how this might happen:

Visual Basic
Shared Sub Invoke(ByVal ParamArray actions As Action())

Invoke is supplied with an array of delegates, and it needs to perform an action for each one, potentially in parallel. That sounds like a pattern to which ForEach can be applied, doesnt it? In fact, we could implement a MyParallelInvoke using the MyParallelForEach we previously coded:

Visual Basic
Shared Sub MyParallelInvoke(ByVal ParamArray actions As Action()) MyParallelForEach(actions, Sub(action) action() End Sub) End Sub

We could even use MyParallelFor:

Visual Basic
Shared Sub MyParallelInvoke(ByVal ParamArray actions As Action()) MyParallelFor(0, actions.Length, Sub(i) action(i)() End Sub) End Sub

This is very similar to the type of operation Parallel.Invoke will perform when provided with enough delegates. The overhead of a parallel loop is more than that of a few tasks, and thus when running only a few delegates, it makes sense for Invoke to simply use one task per element. But after a certain threshold, its more efficient to use a parallel loop to execute all of the actions, as the cost of the loop is amortized across all of the delegate invocations.

ONE TASK PER ELEMENT

Parallel.Invoke represents a prototypical example of the fork/join pattern. Multiple operations are launched in parallel and then joined with such that only when theyre all complete will the entire operation be considered complete. If we think of each individual delegate invocation from Invoke as being its own asynchronous operation, we can use a pattern of applying one task per element, where in this case the element is the delegate:

Visual Basic
Shared Sub MyParallelInvoke(ByVal ParamArray actions As Action()) Dim tasks = New Task(actions.Length - 1) {} For i = 0 To actions.Length - 1 tasks(i) = Task.Factory.StartNew(actions(i)) Next i Task.WaitAll(tasks) End Sub

Patterns of Parallel Programming

Page 37

This same pattern can be applied for variations, such as wanting to invoke in parallel a set of functions that return values, with the MyParallelInvoke method returning an array of all of the results. Here are several different ways that could be implemented, based on the patterns shown thus far (do note these implementations each have subtle differences in semantics, particularly with regards to what happens when an individual function fails with an exception):

Visual Basic
Approach #1: One Task per element Shared Function MyParallelInvoke(Of T)( ByVal ParamArray functions() As Func(Of T)) As T() Dim tasks = (From function In functions Select Task.Factory.StartNew(function)).ToArray() Task.WaitAll(tasks) Return tasks.Select(Function(t) t.Result).ToArray() End Function Approach #2: One Task per element, using parent/child Relationships Shared Function MyParallelInvoke(Of T)( ByVal ParamArray functions() As Func(Of T)) As T() Dim results = New T(functions.Length - 1) {} Task.Factory.StartNew(Sub() For i = 0 To functions.Length 1 Dim cur = i Task.Factory.StartNew( Sub() results(cur) = functions(cur)(), TaskCreationOptions.AttachedToParent) Next i End Sub).Wait() Return results End Function Approach #3: Using Parallel.For Shared Function MyParallelInvoke(Of T)( ByVal ParamArray functions() As Func(Of T)) As T() Dim results = New T(functions.Length - 1) {} Parallel.For(0, functions.Length, Sub(i) results(i) = functions(i)() End Sub) Return results End Function Approach #4: Using PLINQ Shared Function MyParallelInvoke(Of T)( ByVal ParamArray functions() As Func(Of T)) As T() Return functions.AsParallel().Select(Function(f) f()).ToArray(); End Function

As with the Action-based MyParallelInvoke, for just a handful of delegates the first approach is likely the most efficient. Once the number of delegates increases to a plentiful amount, however, the latter approaches of using Parallel.For or PLINQ are likely more efficient. They also allow you to easily take advantage of additional functionality built into the Parallel and PLINQ APIs. For example, placing a limit on the degree of parallelism employed with tasks directly requires a fair amount of additional code. Doing the same with either Parallel or PLINQ requires only minimal additions. For example, if I want to use at most two threads to run the operations, I can do the following:

Patterns of Parallel Programming

Page 38

Visual Basic
Shared Function MyParallelInvoke(Of T)( ByVal ParamArray functions() As Func(Of T)) As T() Dim results = New T(functions.Length - 1) {} Dim options = New ParallelOptions With {.MaxDegreeOfParallelism = 2} Parallel.For(0, functions.Length, options, Sub(i) results(i) = functions(i)()) End Sub) Return results End Function

For fork/join operations, the pattern of creating one task per element may be particularly useful in the following situations: 1) Additional work may be started only when specific subsets of the original elements have completed processing. As an example, in the Strassens matrix multiplication algorithm, two matrices are multiplied by splitting each of the matrices into four quadrants. Seven intermediary matrices are generated based on operations on the eight input submatrices. Four output submatrices that make up the larger output matrix are computed from the intermediary seven. These four output matrices each only require a subset of the previous seven, so while its correct to wait for all of the seven prior to computing the following four, some potential for parallelization is lost as a result. 2) All elements should be given the chance to run even if one invocation fails. With solutions based on Parallel and PLINQ, the looping and query constructs will attempt to stop executing as soon as an exception is encountered; this can be solved using manual exception handling with the loop, as demonstrated earlier, however by using Tasks, each operation is treated independently, and such custom code isnt needed.

RECURSIVE DECOMPOSITION
One of the more common fork/join patterns deals with forks that themselves fork and join. This recursive nature is known as recursive decomposition, and it applies to parallelism just as it applies to serial recursive implementations. Consider a Tree<T> binary tree data structure:

Visual Basic
Friend Class Tree(Of T) Public Data As T Public Left As Tree(Of T), Right As Tree(Of T) End Class

A tree walk function that executes an action for each node in the tree might look like the following:

Visual Basic
Shared Sub Walk(Of T)(ByVal root As Tree(Of T), ByVal action As Action(Of T)) If root Is Nothing Then Return action(root.Data) Walk(root.Left, action) Walk(root.Right, action) End Sub

Patterns of Parallel Programming

Page 39

Parallelizing this may be accomplished by fork/joining on at least the two recursive calls, if not also on the action invocation:

Visual Basic
Shared Sub Walk(Of T)(ByVal root As Tree(Of T), ByVal action As Action(Of T)) If root Is Nothing Then Return Parallel.Invoke( Sub() action(root.Data) Sub() Walk(root.Left, action) Sub() Walk(root.Right, action)) End Sub

The recursive calls to Walk themselves fork/join as well, leading to a logical tree of parallel invocations. This can of course also be done using Task objects directly:

Visual Basic
Shared Sub Walk(Of T)(ByVal root As Tree(Of T), ByVal action As Action(Of T)) If root Is Nothing Then Return Dim t1 = Task.Factory.StartNew(Sub() action(root.Data)) Dim t2 = Task.Factory.StartNew(Sub() Walk(root.Left, action)) Dim t3 = Task.Factory.StartNew(Sub() Walk(root.Right, action)) Task.WaitAll(t1, t2, t3) End Sub

We can see all of these Tasks in Visual Studio using the Parallel Tasks debugger window, as shown in the following screenshot:

We can further take advantage of parent/child relationships in order to see the associations between these Tasks in the debugger. First, we can modify our code by forcing all tasks to be attached to a parent, which will be the Task currently executing when the child is created. This is done with the TaskCreationOptions.AttachedToParent option:

Visual Basic
Shared Sub Walk(Of T)(ByVal root As Tree(Of T), ByVal action As Action(Of T)) If root Is Nothing Then Return Dim t1 = Task.Factory.StartNew( Sub() action(root.Data), TaskCreationOptions.AttachedToParent) Dim t2 = Task.Factory.StartNew(

Patterns of Parallel Programming

Page 40

Sub() Walk(root.Left, action), TaskCreationOptions.AttachedToParent) Dim t3 = Task.Factory.StartNew( Sub() Walk(root.Right, action), TaskCreationOptions.AttachedToParent) Task.WaitAll(t1, t2, t3) End Sub

Re-running the application, we can now see the following parent/child hierarchy in the debugger:

CONTINUATION CHAINING
The previous example of walking a tree utilizes blocking semantics, meaning that a particular level wont complete until its children have completed. Parallel.Invoke, and the Task Wait functionality on which its based, attempt whats known as inlining, where rather than simply blocking waiting for another thread to execute a Task, the waiter may be able to run the waitee on the current thread, thereby improving resource reuse, and improving performance as a result. Still, there may be some cases where tasks are not inlinable, or where the style of development is better suited towards a more asynchronous model. In such cases, task completions can be chained. As an example of this, well revisit the Walk method. Rather than returning void, the Walk method can return a Task. That Task can represent the completion of all child tasks. There are two primary ways to accomplish this. One way is to take advantage of Task parent/child relationships briefly mentioned previously. With parent/child relationships, a parent task wont be considered completed until all o f its children have completed.

Visual Basic
Shared Function Walk(Of T)( ByVal root As Tree(Of T), ByVal action As Action(Of T)) As Task Return Task.Factory.StartNew(Sub() If root Is Nothing Then Return Walk(root.Left, action) Walk(root.Right, action) action(root.Data) End Sub, TaskCreationOptions.AttachedToParent) End Function

Every call to Walk creates a new Task thats attached to its parent and immediately returns that Task. That Task, when executed, recursively calls Walk (thus creating Tasks for the children) and executes the relevant action. At

Patterns of Parallel Programming

Page 41

the root level, the initial call to Walk will return a Task that represents the entire tree of processing and that wont complete until the entire tree has completed. Another approach is to take advantage of continuations:

Visual Basic
Shared Function Walk(Of T)( ByVal root As Tree(Of T), ByVal action As Action(Of T)) As Task If root Is Nothing Then Return _completedTask Dim t1 As Task = Task.Factory.StartNew(Sub() action(root.Data)) Dim t2 As Task(Of Task) = Task.Factory.StartNew( Function() Walk(root.Left, action)) Dim t3 As Task(Of Task) = Task.Factory.StartNew( Function() Walk(root.Right, action)) Return Task.Factory.ContinueWhenAll(New Task() {t1, t2.Unwrap(), t3.Unwrap()}, Sub(tasks) Task.WaitAll(tasks)) End Function

As weve previously seen, this code uses a task to represent each of the three operations to be performed at each node: invoking the action for the node, walking the left side of the tree, and walking the right side of the tree. However, we now have a predicament, in that the Task returned for walking each side of the tree is actually a Task<Task> rather than simply a Task. This means that the result will be signaled as completed when the Walk call has returned, but not necessarily when the Task it returned has completed. To handle this, we can take advantage of the Unwrap method, which converts a Task<Task> into a Task, by unwrapping the internal Task into a top level Task that represents it (another overload of Unwrap handles unwrapping a Task<Task<TResult>> into a Task<TResult>). Now with our three tasks, we can employ the ContinueWhenAll method to create and return a Task that represents the total completion of this node and all of its descendants. In order to ensure exceptions are propagated correctly, the body of that continuation explicitly waits on all of the tasks; it knows theyre completed by this point, so this is simply to utilize the exception propagation logic in WaitAll. The parent-based approach has several advantages, including that the Visual Studio 2010 Parallel Tasks toolwindow can highlight the parent/child relationship involved, showing the task hierarchy graphically during a debugging session, and exception handling is simplified, as all exceptions will bubble up to the root parent. However, the continuation approach may have a memory benefit for deep hierarchies or longchains of tasks, since with the parent/child relationships, running children prevent the parent nodes from being garbage collected. To simplify this, you can consider codifying this into an extension method for easier implementation:

Visual Basic
<Extension()> Shared Function ContinueWhenAll( ByVal factory As TaskFactory, ByVal ParamArray tasks() As Task) As Task Return factory.ContinueWhenAll(tasks, Sub(completed) Task.WaitAll(completed)) End Function

With that extension method in place, the previous snippet may be rewritten as:

Patterns of Parallel Programming

Page 42

Visual Basic
Shared Function Walk(Of T)( ByVal root As Tree(Of T), ByVal action As Action(Of T)) As Task If root Is Nothing Then Return _completedTask Dim t1 = Task.Factory.StartNew(Sub() action(root.Data)) Dim t2 = Task.Factory.StartNew(Function() Walk(root.Left, action)) Dim t3 = Task.Factory.StartNew(Function() Walk(root.Right, action)) Return Task.Factory.ContinueWhenAll(t1, t2.Unwrap(), t3.Unwrap()) End Function

One additional thing to notice is the _completedTask returned if the root node is null. Both WaitAll and ContinueWhenAll will throw an exception if the array of tasks passed to them contains a null element. There are several ways to work around this, one of which is to ensure that a null element is never provided. To do that, we can return a valid Task from Walk even if there is no node to be processed. Such a Task should be already completed so that little additional overhead is incurred. To accomplish this, we can create a single Task using a TaskCompletionSource<TResult>, resolve the Task into a completed state, and cache it for all code that needs a completed Task to use:

Visual Basic
Private Shared _completedTask As Task = (Function() Dim tcs = New TaskCompletionSource(Of Object)() tcs.SetResult(Nothing) Return tcs.Task End Function)()

ANTI-PATTERNS FALSE SHARING

Data access patterns are important for serial applications, and theyre even more important for parallel applications. One serious performance issue that can arise in parallel applications occurs where unexpected sharing happens at the hardware level. For performance reasons, memory systems use groups called cache lines, typically of 64 bytes or 128 bytes. A cache line, rather than an individual byte, is moved around the system as a unit, a classic example of chunky instead of chatty communication. If multiple cores attempt to access two different bytes on the same cache line, theres no correctness sharing conflict, but only one will be able to have exc lusive access to the cache line at the hardware level, thus introducing the equivalent of a lock at the hardware level that wasnt otherwise present in the code. This can lead to unforeseen and serious performance problems. As an example, consider the following method, which uses a Parallel.Invoke to initialize two arrays to random values:

Visual Basic
Private Sub WithFalseSharing() Dim rand1 As New Random(), rand2 As New Random() Dim results1 = New Integer(19999999) {}, results2 = New Integer(19999999) {} Parallel.Invoke(Sub() For i = 0 To results1.Length - 1

Patterns of Parallel Programming

Page 43

results1(i) = rand1.[Next]() Next End Sub, Sub() For i = 0 To results2.Length - 1 results2(i) = rand2.[Next]() Next End Sub) End Sub

The code initializes two distinct System.Random instances and two distinct arrays, such that each thread involved in the parallelization touches its own non-shared state. However, due to the way these two Random instances were allocated, theyre likely on the same cache line in memory . Since every call to Next modifies the Random instances internal state, multiple threads will now be contending for the same cache line, leading to seriously impacted performance. Heres a version that addresses the issue:

Visual Basic
Private Sub WithoutFalseSharing() Dim results1(), results2() As Integer Parallel.Invoke(Sub() Dim rand1 As New Random() results1 = New Integer(19999999) {} For i = 0 To results1.Length - 1 results1(i) = rand1.Next() Next i End Sub, Sub() Dim rand2 As New Random() results2 = New Integer(19999999) {} For i = 0 To results2.Length - 1 results2(i) = rand2.Next() Next i End Sub) End Sub

On my dual-core system, when comparing the performance of these two methods, the version with false sharing typically ends up running slower than the serial equivalent, whereas the version without false sharing typically ends up running almost twice as fast as the serial equivalent. False sharing is a likely source for investigation if you find that parallelized code operating with minimal synchronization isnt obtaining the parallelized performance improvements you expected . More information is available in the MSDN Magazine article .NET Matters: False Sharing.

RECURSION WITHOUT THRESHOLDS

In a typical introductory algorithms course, computer science students learn about various algorithms for sorting, often culminating in quicksort. Quicksort is a recursive divide-and-conquer algorithm, where the input array to be sorted is partitioned into two contiguous chunks, one with values less than a chosen pivot and one with values greater than or equal to a chosen pivot. Once the array has been partitioned, the quicksort routine may be used recursively to sort each chunk. The recursion ends when the size of a chunk is one element, since one element is implicitly sorted.

Patterns of Parallel Programming

Page 44

Students learn that quicksort has an average algorithmic complexity of O(N log N), which for large values of N is 2 much faster than other algorithms like insertion sort which have a complexity of O(N ). They also learn, however, that big-O notation focuses on the limiting behavior of functions and ignores constants, because as the value of N grows, the constants arent relevant. Yet when N is small, those constants can actually make a difference. It turns out that constants involved in quicksort are larger than those involved in insertion sort, and as such, for small values of N, insertion sort is often faster than quicksort. Due to quicksorts recursive nature, even if the operation starts out operating on a large N, at some point in the recursion the value of N for that particular call is small enough that its actually better to use insertion sort. Thus, many quality implementations of quicksort wont stop the recursion when a chunk size is one, but rather will choose a higher value, and when that threshold is reached, the algorithm will switch over to a call to insertion sort to sort the chunk, rather than continuing with the recursive quicksort routine. As has been shown previously, quicksort is a great example for recursive decomposition with task-based parallelism, as its easy to recursively sort the left and right partitioned chunks in parallel, as shown in the following example:

Visual Basic
Shared Sub QuickSort(Of T As IComparable(Of T))( ByVal data() As T, ByVal fromInclusive As Integer, ByVal toExclusive As Integer) If toExclusive - fromInclusive <= THRESHOLD Then InsertionSort(data, fromInclusive, toExclusive) Else Dim pivotPos As Integer = Partition(data, fromInclusive, toExclusive) Parallel.Invoke( Sub() QuickSort(data, fromInclusive, pivotPos), Sub() QuickSort(data, pivotPos, toExclusive)) End If End Sub

Youll note, however, that in addition to the costs associated with the quicksort algorithm itself, we now have additional overheads involved with creating tasks for each half of the sort. If the computation is completely balanced, at some depth into the recursion we will have saturated all processors. For example, on a dual-core machine, the first level of recursion will create two tasks, and thus theoretically from that point forward were saturating the machine and theres no need to continue to bear the overhe ad of additional tasks. This implies that we now may benefit from a second threshold: in addition to switching from quicksort to insertion sort at some threshold, we now also want to switch from parallel to serial at some threshold. That threshold may be defined in a variety of ways. As with the insertion sort threshold, a simple parallel threshold could be based on the amount of data left to be processed:

Visual Basic
Shared Sub QuickSort(Of T As IComparable(Of T))( ByVal data() As T, ByVal fromInclusive As Integer, ByVal toExclusive As Integer) If toExclusive - fromInclusive <= THRESHOLD Then InsertionSort(data, fromInclusive, toExclusive) Else : Dim pivotPos = Partition(data, fromInclusive, toExclusive) If toExclusive - fromInclusive <= PARALLEL_THRESHOLD Then

Patterns of Parallel Programming

Page 45

' NOTE: PARALLEL_THRESHOLD is chosen to be greater than THRESHOLD. QuickSort(data, fromInclusive, pivotPos) QuickSort(data, pivotPos, toExclusive) Else Parallel.Invoke( Sub() QuickSort(data, fromInclusive, pivotPos), Sub() QuickSort(data, pivotPos, toExclusive)) End If End If End Sub

Another simple threshold may be based on depth. We can initialize the depth to the max depth we want to recur to in parallel, and decrement the depth each time we recur when it reaches 0, we fall back to serial.

Visual Basic
Shared Sub QuickSort(Of T As IComparable(Of T))( ByVal data() As T, ByVal fromInclusive As Integer, ByVal toExclusive As Integer, ByVal depth As Integer) If toExclusive - fromInclusive <= THRESHOLD Then InsertionSort(data, fromInclusive, toExclusive) Else Dim pivotPos = Partition(data, fromInclusive, toExclusive) If depth > 0 Then Parallel.Invoke( Sub() QuickSort(data, fromInclusive, pivotPos, depth - 1), Sub() QuickSort(data, pivotPos, toExclusive, depth - 1)) Else QuickSort(data, fromInclusive, pivotPos, 0) QuickSort(data, pivotPos, toExclusive, 0) End If End If End Sub

If you assume that the parallelism will be completely balanced due to equal work resulting from all partition operations, you might then base the initial depth on the number of cores in the machine:

Visual Basic
QuickSort(data, 0, data.Length, Math.Log(Environment.ProcessorCount, 2))

Alternatively, you might provide a bit of extra breathing room in case the problem space isnt perfectly balanced:

Visual Basic
QuickSort(data, 0, data.Length, Math.Log(Environment.ProcessorCount, 2) + 1)

Of course, the partitioning may result in very unbalanced workloads. And quicksort is just one example of an algorithm; many other algorithms that are recursive in this manner will frequently result in very unbalanced workloads. Another approach is to keep track of the number of outstanding work items, and only go parallel when the number of outstanding items is below a threshold. An example of this follows:

Visual Basic
Patterns of Parallel Programming Page 46

Friend Class Utilities Private Shared CONC_LIMIT = Environment.ProcessorCount * 2 Dim _invokeCalls = 0 Public Sub QuickSort(Of T As IComparable(Of T))( ByVal data() As T, ByVal fromInclusive As Integer, ByVal toExclusive As Integer) If toExclusive - fromInclusive <= THRESHOLD Then InsertionSort(data, fromInclusive, toExclusive) Else Dim pivotPos As Integer = Partition(data, fromInclusive, toExclusive) If Thread.VolatileRead(_invokeCalls) < CONC_LIMIT Then Interlocked.Increment(_invokeCalls) Parallel.Invoke( Sub() QuickSort(data, fromInclusive, pivotPos), Sub() QuickSort(data, pivotPos, toExclusive)) Interlocked.Decrement(_invokeCalls) Else QuickSort(data, fromInclusive, pivotPos) QuickSort(data, pivotPos, toExclusive) End If End If End Sub End Class

Here, were keeping track of the number of Parallel.Invoke calls active at any one time. When the number is below a predetermined limit, we recur using Parallel.Invoke; otherwise, we recur serially. This adds the additional expense of two interlocked operations per recursive call (and is only an approximation, as the _invokeCalls field is compared to the threshold outside of any synchronization), forcing synchronization where it otherwise wasnt needed, but it also allows for more load-balancing. Previously, once a recursive path was serial, it would remain serial. With this modification, a serial path through QuickSort may recur and result in a parallel path.

Patterns of Parallel Programming

Page 47

PASSING DATA
There are several common patterns in the .NET Framework for passing data to asynchronous work.

CLOSURES
Since support for them was added to C# and Visual Basic, closures represent the easiest way to pass data into background operations. By creating delegates that refer to state outside of their scope, the compiler transforms the accessed variables in a way that makes them accessible to the delegates, closing over those variables . This makes it easy to pass varying amounts of data into background work:

Visual Basic
Private data1 As Integer = 42 Private data2 As String = "The Answer to the Ultimate Question of " & "Life, the Universe, and Everything" Task.Factory.StartNew(Sub() Console.WriteLine(data2 & ": " & data1))

For applications in need of the utmost in performance and scalability, its important to keep in mind that under the covers the compiler may actually be allocating an object in which to store the variables (in the above example, data1 and data2) that are accessed by the delegate.

STATE OBJECTS
Dating back to the beginning of the .NET Framework, many APIs that spawn asynchronous work accept a state parameter and pass that state object into the delegate that represents the body of work. The ThreadPool.QueueUserWorkItem method is a quintessential example of this:

Visual Basic
Public Shared Function QueueUserWorkItem( ByVal callBack As WaitCallback, ByVal state As Object) As Boolean ... Public Delegate Sub WaitCallback(ByVal state As Object)

We can take advantage of this state parameter to pass a single object of data into the WaitCallback:

Visual Basic
ThreadPool.QueueUserWorkItem(Sub(state) Console.WriteLine(TryCast(state, String)) End Sub, data2)

The Task class in the .NET Framework 4 also supports this pattern:

Visual Basic
Task.Factory.StartNew(Sub(state) Console.WriteLine(TryCast(state, String)) End Sub, data2)

Note that in contrast to the closures approach, this typically does not cause an extra object allocation to handle the state, unless the state being supplied is a value type (value types must be boxed to supply them as the object state parameter). Patterns of Parallel Programming Page 48

To pass in multiple pieces of data with this approach, those pieces of data must be wrapped into a single object. In the past, this was typically a custom class to store specific pieces of information. With the .NET Framework 4, the new Tuple<> classes may be used instead:

Visual Basic
Dim data As Tuple(Of Integer, String) = Tuple.Create(data1, data2) Task.Factory.StartNew(Sub(state) Dim d = DirectCast(state, Tuple(Of Integer, String)) Console.WriteLine(d.Item2 & ": " & d.Item1) End Sub, data)

As with both closures and working with value types, this requires an object allocation to support the creation of the tuple to wrap the data items. The built-in tuple types in the .NET Framework 4 also support a limited number of contained pieces of data.

STATE OBJECTS WITH MEMBER METHODS

Another approach, similar to the former, is to pass data into asynchronous operations by representing the work to be done asynchronously as an instance method on a class. This allows data to be passed in to that method implicitly through the this reference.

Visual Basic
Class Work Public Data1 As Integer Public Data2 As String Public Sub Run() Console.WriteLine(Data1 & ": " & Data2) End Sub End Class ... Dim w As New Work() w.Data1 = 42 w.Data2 = "The Answer to the Ultimate Question of " & "Life, the Universe, and Everything" Task.Factory.StartNew(AddressOf w.Run)

As with the previous approaches, this approach requires an object allocation for an object (in this case, of class Work) to store the state. Such an allocation is still required if Work is a struct instead of a class; this is because the creation of a delegate referring to Work must reference the object on which to invoke the instance method Run, and that reference is stored as an object, thus boxing the struct. As such, which of these approaches you choose is largely a matter of preference. The closures approach typically leads to the most readable code, and it allows the compiler to optimize the creation of the state objects. For example, if the anonymous delegate passed to StartNew doesnt access any local state, the compiler may be able to avoid the object allocation to store the state, as it will already be stored as accessible instance or Shared fields.

ANTI-PATTERNS CLOSING OVER INAPPROPRIATELY SHARED DATA

Patterns of Parallel Programming Page 49

Consider the following code, and hazard a guess for what it outputs:

Visual Basic
Shared Sub Main() For i = 0 To 9 ThreadPool.QueueUserWorkItem(Sub() Console.WriteLine(i)) Next i Console.ReadLine() End Sub

If you guessed that this outputs the numbers 0 through 9 inclusive, youd likely be wrong. While that might be the output, more than likely this will actually output ten 10s. The reason for this has to do with the languages rules for scoping and how it captures variables into anonymous methods, which here were used to represent the work provided to QueueUserWorkItem. The variable i is shared by both the main thread queuing the work items and the ThreadPool threads printing out the value of i. The main thread is continually updating the value of i as it iterates from 0 through 9, and thus each output line will contain the value of i at whatever moment the Console.WriteLine call occurs on the background thread. (Note that unlike the C# compiler, the Visual Basic compiler kindly warns about this issue: warning BC42324: Using the iteration variable in a lambda expression may have unexpected results. Instead, create a local variable within the loop and assign it the value of the iteration variable.) This phenomenon isnt limited to parallel programming, though the prominence of anonymous methods and lambda expressions in the the .NET Framework parallel programming model does exacerbate the issue. For a serial example, consider the following code:

Visual Basic
Shared Sub Main() Dim actions = New List(Of Action)() For i = 0 To 9 actions.Add(Sub() Console.WriteLine(i)) Next i actions.ForEach(Sub(action) action()) End Sub

This code will reliably output ten 10s, as by the time the Action delegates are invoked, the value of i is already 10, and all of the delegates are referring to the same captured i variable. To address this issue, we can create a local copy of the iteration variable in scope inside the loop (as was recommended by the Visual Basic compiler). This will cause each anonymous method to gain its own variable, rather than sharing them with other delegates. The sequential code shown earlier can be fixed with a small alteration:

Visual Basic
Shared Sub Main() Dim actions = New List(Of Action)() For i = 0 To 9 Dim tmp = i actions.Add(Sub() Console.WriteLine(tmp)) Next i actions.ForEach(Sub(action) action()) End Sub

Patterns of Parallel Programming

Page 50

This will reliably print out the sequence 0 through 9 as expected. The parallel code can be fixed in a similar manner:

Visual Basic
Shared Sub Main() For i = 0 To 9 Dim tmp = i ThreadPool.QueueUserWorkItem(Sub() Console.WriteLine(tmp)) Next i Console.ReadLine() End Sub

This will also reliably print out the values 0 through 9, although the order in which theyre printed is not guaranteed. Another similar case where closure semantics can lead you astray is if youre in the habit of declaring your variables at the top of your function, and then using them later on. For example:

Visual Basic
Shared Sub Main(ByVal args() As String) Dim j As Integer Parallel.For(0, 10000, Sub(i) Dim total = 0 For j = 1 To 10000 total += j Next j End Sub) End Sub

Due to closure semantics, the j variable will be shared by all iterations of the parallel loop, thus wreaking havoc on the inner serial loop. To address this, the variable declarations should be moved as close to their usage as possible:

Visual Basic
Shared Sub Main(ByVal args() As String) Parallel.For(0, 10000, Sub(i) Dim total As Integer = 0 For j = 1 To 10000 total += j Next j End Sub) End Sub

Patterns of Parallel Programming

Page 51

PRODUCER/CONSUMER
The real world revolves around the producer/consumer pattern. Individual entities are responsible for certain functions, where some entities generate material that ends up being consumed by others. In some cases, those consumers are also producers for even further consumers. Sometimes there are multiple producers per consumer, sometimes there are multiple consumers per producer, and sometimes theres a many -to-many relationship. We live and breathe producer/consumer, and the pattern similarly has a very high value in parallel computing. Often, producer/consumer relationships are applied to parallelization when theres no ability to parallelize an individual operation, but when multiple operations may be carried out concurrently, with one having a dependency on the other. For example, consider the need to both compress and encrypt a particular file. This can be done sequentially, with a single thread reading in a chunk of data, compressing it, encrypting the compressed data, writing out the encrypted data, and then repeating the process for more chunks until the input file has been completely processed. Depending on the compression and encryption algorithms utilized, there may not be the ability to parallelize an individual compression or encryption, and the same data certainly cant be compressed concurrently with it being encrypted, as the encryption algorithm must run over the compressed data rather than over the uncompressed input. Instead, multiple threads may be employed to form a pipeline. One thread can read in the data. That thread can hand the read data off to another thread that compresses it, and in turn hands the compressed data off to a third thread. The third thread can then encrypt it, and pass it off to a fourth thread, which writes the encrypted data to the output file. Each processing agent, or actor, in this scheme is serial in nature, churning its input into output, and as long as the hand-offs between agents dont introduce any reordering operations, the output data from the entire process will emerge in the same order the associated data was input. Those hand-offs can be managed with the new BlockingCollection<> type, which provides key support for this pattern in the .NET Framework 4.

PIPELINES
Hand-offs between threads in a parallelized system require shared state: the producer needs to put the output data somewhere, and the consumer needs to know where to look to get its input data. More than just having access to a storage location, however, there is additional communication thats necessary . A consumer is often prevented from making forward progress until theres some data to be consumed . Additionally, in some systems, a producer needs to be throttled so as to avoid producing data much faster than consumers can consume it. In both of these cases, a notification mechanism must also be incorporated. Additionally, with multiple producers and multiple consumers, participants must not trample on each other as they access the storage location. We can build a simple version of such a hand-off mechanism using a Queue<T> and a SemaphoreSlim:

Visual Basic
Class BlockingQueue(Of T) Private _queue As New Queue(Of T)() Private _semaphore As New SemaphoreSlim(0, Integer.MaxValue) Public Sub Enqueue(ByVal data As T) If data Is Nothing Then Throw New ArgumentNullException("data") SyncLock _queue _queue.Enqueue(data) End SyncLock _semaphore.Release()

Patterns of Parallel Programming

Page 52

End Sub Public Function Dequeue() As T _semaphore.Wait() SyncLock _queue Return _queue.Dequeue() End SyncLock End Function End Class

Here we have a very simple blocking queue data structure. Producers call Enqueue to add data into the queue, which adds the data to an internal Queue<T> and notifies consumers using a semaphore that another element of data is available. Similarly, consumers use Dequeue to wait for an element of data to be available and then remove that data from the underlying Queue<T>. Note that because multiple threads could be accessing the data structure concurrently, a lock is used to protect the non-thread-safe Queue<T> instance. Another similar implementation makes use of Monitors notification capabilities instead of using a semaphore:

Visual Basic
Class BlockingQueue(Of T) Private _queue As New Queue(Of T)() Public Sub Enqueue(ByVal data As T) If data Is Nothing Then Throw New ArgumentNullException("data") SyncLock _queue _queue.Enqueue(data) Monitor.Pulse(_queue) End SyncLock End Sub Public Function Dequeue() As T SyncLock _queue Do While _queue.Count = 0 Monitor.Wait(_queue) Loop Return _queue.Dequeue() End SyncLock End Function End Class

Such implementations provide basic support for data hand-offs between threads, but they also lack several important things. How do producers communicate that there will be no more elements produced? With this blocking behavior, what if a consumer only wants to block for a limited amount of time before doing something else? What if producers need to be throttled, such that if the underlying Queue<T> is full theyre blocked from adding to it? What if you want to pull from one of several blocking queues rather than from a single one? What if semantics others than first-in-first-out (FIFO) are required of the underlying storage? What if producers and consumers need to be canceled? And so forth. All of these questions have answers in the new .NET Framework 4 System.Collections.Concurrent.BlockingCollection<T> type in System.dll. It provides the same basic behavior as shown in the nave implementation above, sporting methods to add to and take from the collection. But it also supports throttling both consumers and producers, timeouts on waits, support for arbitrary underlying data

Patterns of Parallel Programming

Page 53

structures, and more. It also provides built-in implementations of typical coding patterns related to producer/consumer in order to make such patterns simple to utilize. As an example of a standard producer/consumer pattern, consider the need to read in a file, transform each line using a regular expression, and write out the transformed line to a new file. We can implement that using a Task to run each step of the pipeline asynchronously, and BlockingCollection<string> as the hand-off point between each stage.

Visual Basic
Shared Sub ProcessFile(ByVal inputPath As String, ByVal outputPath As String) Dim inputLines = New BlockingCollection(Of String)() Dim processedLines = New BlockingCollection(Of String)() ' Stage #1 Dim readLines = Task.Factory.StartNew( Sub() Try For Each line In File.ReadLines(inputPath) inputLines.Add(line) Next line Finally inputLines.CompleteAdding() End Try End Sub) ' Stage #2 Dim processLines = Task.Factory.StartNew( Sub() Try For Each line In inputLines.GetConsumingEnumerable() .Select(Function(line) Regex.Replace(line, "\s+", ", ")) processedLines.Add(line) Next line Finally processedLines.CompleteAdding() End Try End Sub) ' Stage #3 Dim writeLines = Task.Factory.StartNew( Sub() File.WriteAllLines(outputPath, processedLines.GetConsumingEnumerable())) Task.WaitAll(readLines, processLines, writeLines) End Sub

With this basic structure coded up, we have a lot of flexibility and room for modification. For example, what if we discover from performance testing that were reading from the input file much faster than the processing and outputting can handle it? One option is to limit the speed at which the input file is read, which can be done by modifying how the inputLines collection is created:

Visual Basic
Dim inputLines = New BlockingCollection(Of String)(boundedCapacity:=20)

Patterns of Parallel Programming

Page 54

By adding the boundedCapacity parameter (shown here for clarity using named parameter functionality, which is now supported by both C# and Visual Basic in Visual Studio 2010), a producer attempting to add to the collection will block until there are less than 20 elements in the collection, thus slowing down the file reader. Alternatively, we could further parallelize the solution. For example, lets assume that through testing you found the real problem to be that the processLines Task was heavily compute bound. To address that, you could parallelize it using PLINQ in order to utilize more cores:

Visual Basic
For Each line In inputLines.GetConsumingEnumerable() .AsParallel().AsOrdered() .Select(Function(line) Regex.Replace(line, "\s+", ", "))

Note that by specifying .AsOrdered() after the .AsParallel(), were ensuring that PLINQ maintains the same ordering as in the sequential solution.

DECORATOR TO PIPELINE
The decorator pattern is one of the original Gang Of Four design patterns. A decorator is an object that has the same interface as another object it contains. In object-oriented terms, it is an object that has an "is-a" and a "hasa" relationship with a specific type. Consider the CryptoStream class in the System.Security.Cryptography namespace. CryptoStream derives from Stream (it "is-a" Stream), but it also accepts a Stream to its constructor and stores that Stream internally (it "has-a" stream); that underlying stream is where the encrypted data is stored. CryptoStream is a decorator. With decorators, we typically chain them together. For example, as alluded to in the introduction to this section on producer/consumer, a common need in software is to both compress and encrypt data. The .NET Framework contains two decorator stream types to make this feasible: the CryptoStream class already mentioned, and the GZipStream class. We can compress and encrypt an input file into an output file with code like the following:

Visual Basic
Shared Sub CompressAndEncrypt( ByVal inputFile As String, ByVal outputFile As String) Using input = File.OpenRead(inputFile) Using output = File.OpenWrite(outputFile) Using rijndael = New RijndaelManaged() Using transform = rijndael.CreateEncryptor() Using encryptor = New CryptoStream(output, transform, CryptoStreamMode.Write) Using compressor = New GZipStream(encryptor, CompressionMode.Compress, True) input.CopyTo(compressor) End Using End Using End Using End Using End Using End Using End Sub

The input file stream is copied to a GZipStream, which wraps a CryptoStream, which wraps the output stream. The data flows from one stream to the other, with its data modified along the way.

Patterns of Parallel Programming

Page 55

Both compression and encryption are computationally intense operations, and as such it can be beneficial to parallelize this operation. However, given the nature of the problem, its not just as simple as running both the compression and encryption in parallel on the input stream, since the encryption operates on the output of the compression. Instead, we can form a pipeline, with the output of the compression being fed as the input to the encryption, such that while the encryption is processing data block N, the compression routine can have already moved on to be processing N+1 or greater. To make this simple, well implement it with another decorator, a TransferStream. The idea behind this stream is that writes are offloaded to another thread, which sequentially writes to the underlying stream all of the writes to the transfer stream. That way, when code calls Write on the transfer stream, its not blocked waiting for the whole chain of decorators to complete their processi ng: Write returns immediately after queuing the work, and the caller can go on to do additional work. A simple implementation of TransferStream is shown below (relying on a custom Stream base type, which simply implements the abstract Stream class with default implementations of all abstract members, in order to keep the code shown here small), taking advantage of both Task and BlockingCollection:

Visual Basic
Public NotInheritable Class TransferStream Inherits AbstractStreamBase Private _writeableStream As Stream Private _chunks As BlockingCollection(Of Byte()) Private _processingTask As Task Public Sub New(ByVal writeableStream As Stream) ' ... Would validate arguments here _writeableStream = writeableStream _chunks = New BlockingCollection(Of Byte())() _processingTask = Task.Factory.StartNew(Sub() For Each chunk In _chunks.GetConsumingEnumerable() _writeableStream.Write(chunk, 0, chunk.Length) Next End Sub, TaskCreationOptions.LongRunning) End Sub Public Overrides ReadOnly Property CanWrite() As Boolean Get Return True End Get End Property Public Overrides Sub Write( ByVal buffer() As Byte, ByVal offset As Integer, ByVal count As Integer) ' ... Would validate arguments here Dim chunk = New Byte(count - 1) {} Buffer.BlockCopy(buffer, offset, chunk, 0, count) _chunks.Add(chunk) End Sub Public Overrides Sub Close() _chunks.CompleteAdding() Try _processingTask.Wait() Finally MyBase.Close()

Patterns of Parallel Programming

Page 56

End Try End Sub End Class

The constructor stores the underlying stream to be written to. It then sets up the necessary components of the parallel pipeline. First, it creates a BlockingCollection<byte[]> to store all of the data chunks to be written. Then, it launches a long-running Task that continually pulls from the collection and writes each chunk out to the underlying stream. The Write method copies the provided input data into a new array which it enqueues to the BlockingCollection; by default, BlockingCollection uses a queue data structure under the covers, maintaining firstin-first-out (FIFO) semantics, so the data will be written to the underlying stream in the same order its added to the collection, a property important for dealing with streams which have an implicit ordering. Finally, closing the stream marks the BlockingCollection as complete for adding, which will cause the consuming loop in the Task launched in the constructor to cease as soon as the collection is empty, and then waits for the Task to complete; this ensures that all data is written to the underlying stream before the underlying stream is closed, and it propagates any exceptions that may have occurred during processing. With our TransferStream in place, we can now use it to parallelize our compression/encryption snippet shown earlier:

Visual Basic
Shared Sub CompressAndEncrypt( ByVal inputFile As String, ByVal outputFile As String) Using input = File.OpenRead(inputFile) Using output = File.OpenWrite(outputFile) Using rijndael = New RijndaelManaged() Using transform = rijndael.CreateEncryptor() Using encryptor = New CryptoStream(output, transform, CryptoStreamMode.Write) Using threadTransfer = New TransferStream(encryptor) Using compressor = New GZipStream(threadTransfer, CompressionMode.Compress, True) input.CopyTo(compressor) End Using End Using End Using End Using End Using End Using End Using End Sub

With those simple changes, weve now modified the operation so that both the compression and the encryption may run in parallel. Of course, its important to note here that there are implicit limits on how much speedup I can achieve from this kind of parallelization. At best the code is doing only two elements of work concurrently, overlapping the compression with encryption, and thus even on a machine with more than two cores, the best speedup I can hope to achieve is 2x. Note, too, that I could use additional transfer streams in order to read concurrently with compressing and to write concurrently with encrypting, as such:

Visual Basic
Shared Sub CompressAndEncrypt( ByVal inputFile As String, ByVal outputFile As String) Using input = File.OpenRead(inputFile)

Patterns of Parallel Programming

Page 57

output = File.OpenWrite(outputFile) t2 = New TransferStream(output)) rijndael = New RijndaelManaged() transform = rijndael.CreateEncryptor() encryptor = New CryptoStream(t2, transform, CryptoStreamMode.Write) Using t1 = New TransferStream(encryptor)) Using threadTransfer = New TransferStream(encryptor) Using compressor = New GZipStream(t1, CompressionMode.Compress, True) Using t0 = New TransferStream(compressor)) input.CopyTo(t0) End Using End Using End Using End Using End Using End Using End Using End Using End Using End Using End Sub

Using Using Using Using Using

Benefits of doing this might manifest if I/O is a bottleneck.

IPRODUCERCONSUMERCOLLECTION<T>
As mentioned, BlockingCollection<T> defaults to using a queue as its storage mechanism, but arbitrary storage mechanisms are supported. This is done utilizing a new interface in the .NET Framework 4, passing an instance of an implementing type to the BlockingCollections constructor:

Visual Basic
Public Interface IProducerConsumerCollection(Of T) Inherits IEnumerable(Of T), ICollection, IEnumerable Function TryAdd(ByVal item As T) As Boolean Function TryTake(<Out()> ByRef item As T) As Boolean Function ToArray() As T() Sub CopyTo(ByVal array() As T, ByVal index As Integer) End Interface Public Class MyBlockingCollection(Of T) '... '... Public Sub New(ByVal collection As IProducerConsumerCollection(Of T)) End Sub Public Sub New(ByVal collection As IProducerConsumerCollection(Of T), ByVal boundedCapacity As Integer) End Sub '... End Class

Patterns of Parallel Programming

Page 58

Aptly named to contain the name of this pattern, IProducerConsumerCollection<T> represents a collection used in producer/consumer implementations, where data will be added to the collection by producers and taken from it by consumers. Hence, the primary two methods on the interface are TryAdd and TryTake, both of which must be implemented in a thread-safe and atomic manner. The .NET Framework 4 provides three concrete implementations of this interface: ConcurrentQueue<T>, ConcurrentStack<T>, and ConcurrentBag<T>. ConcurrentQueue<T> is the implementation of the interface used by default by BlockingCollection<T>, providing first-in-first-out (FIFO) semantics. ConcurrentStack<T> provides last-in-first-out (LIFO) behavior, and ConcurrentBag<T> eschews ordering guarantees in favor of improved performance in various use cases, in particular those in which the same thread will be acting as both a producer and a consumer. In addition to BlockingCollection<T>, other data structures may be built around IProducerConsumerCollection<T>. For example, an object pool is a simple data structure thats m eant to allow object reuse. We could build a concurrent object pool by tying it to a particular storage type, or we can implement one in terms of IProducerConsumerCollection<T>.

Visual Basic
Public NotInheritable Class ObjectPool(Of T) Private _generator As Func(Of T) Private _objects As IProducerConsumerCollection(Of T) Public Sub New(ByVal generator As Func(Of T)) Me.New(generator, New ConcurrentQueue(Of T)()) End Sub Public Sub New(ByVal generator As Func(Of T), ByVal storage As IProducerConsumerCollection(Of T)) If generator Is Nothing Then Throw New ArgumentNullException("generator") If storage Is Nothing Then Throw New ArgumentNullException("storage") _generator = generator _objects = storage End Sub Public Function [Get]() As T Dim item As T If Not _objects.TryTake(item) Then item = _generator() Return item End Function Public Sub Put(ByVal item As T) _objects.TryAdd(item) End Sub End Class

By parameterizing the storage in this manner, we can adapt our ObjectPool<T> based on use cases and the associated strengths of the collection implementation. For example, for doing a graphics-intensive UI application, we may want to render to buffers on background threads and then bitbl ip those buffers onto the UI on the UI thread. Given the likely size of these buffers, rather than continually allocating large objects and forcing the

Patterns of Parallel Programming

Page 59

garbage collector to clean up after me, we can pool them. In this case, a ConcurrentQueue<T> is a likely choice for the underlying storage. Conversely, if the pool were being used in a concurrent memory allocator to cache objects of varying sizes, I dont need the FIFO-ness of ConcurrentQueue<T>, and I would be better off with a data structure that minimizes synchronization between threads; for this purpose, ConcurrentBag<T> might be ideal. Under the covers, ConcurrentBag<T> utilizes a list of instances of T per thread. Each thread that accesses the bag is able to add and remove data in relative isolation from other threads accessing the bag. Only when a thread tries to take data out and its local list is empty will it go in search of items from other threads (the implementation makes the thread-local lists visible to other threads for only this purpose). This might sound familiar: ConcurrentBag<T> implements a pattern very similar to the workstealing algorithm employed by the the .NET Framework 4 ThreadPool. While accessing the local list is relatively inexpensive, stealing from another threads list is relatively quite expensive. As a result, ConcurrentBag<T> is best for situations where each thread only needs its own local list the majority of the time. In the object pool example, to assist with this it could be worthwhile for every thread to initially populate the pool with some objects, such that when it later gets and puts objects, it will be dealing predominantly with its own queue.

PRODUCER/CONSUMER EVERYWHERE
If youve written a Windows-based application, its extremely likely youve used the producer/consumer pattern, potentially without even realizing it. Producer/consumer has many prominent implementations.

THREAD POOLS
If youve used a thread pool, youve used a quintessential implementation of the producer/consumer pattern . A thread pool is typically engineered around a data structure containing work to be performed. Every thread in the pool monitors this data structure, waiting for work to arrive. When work does arrive, a thread retrieves the work, processes it, and goes back to wait for more. In this capacity, the work thats being produced is consumed by the threads in the pool and executed. Utilizing the BlockingCollection<T> type weve already seen, its straightforward to build a simple, no-frills thread pool:

Visual Basic
Public NotInheritable Class SimpleThreadPool Private Shared _work As New BlockingCollection(Of Action)() Shared Sub New() For i = 0 To Environment.ProcessorCount - 1 Dim tmp = New Thread( Sub() For Each action In _work.GetConsumingEnumerable() action() Next action End Sub)

Patterns of Parallel Programming

Page 60

tmp.IsBackground = True tmp.Start() Next i End Sub Public Shared Sub QueueWorkItem(ByVal workItem As Action) _work.Add(workItem) End Sub End Class

In concept, this is very similar to how the ThreadPool type in the .NET Framework 3.5 and earlier operated. In the .NET Framework 4, the data structure used to store the work to be executed is more distributed. Rather than maintaining a single global queue, as is done in the above example, the ThreadPool in .NET Framework 4 maintains not only a global queue but also a queue per thread. Work generated outside of the pool goes into the global queue as it always did, but threads in the pool can put their generated work into the thread-local queues rather than into the global queues. When threads go in search of work to be executed, they first examine their local queue, and only if they dont find anything there, they then check the global queue. If the global queue is found to be empty, the threads are then also able to check the queues of their peers, stealing work from other threads in order to stay busy. This work-stealing approach can provide significant benefits in the form of both minimized contention and synchronization between threads (in an ideal workload, threads can spend most of their time working on their own local queues) as well as cache utilization. (You can approximate this behavior with the SimpleThreadPool by instantiating the BlockingCollection<Action> with an underlying ConcurrentBag<Action> rather than utilizing the default ConcurrentQueue<Action>.) In the previous paragraph, we said that threads in the pool can put their generated work into the thread-local queues, not that they necessarily do. In fact, the ThreadPool.QueueUserWorkItem method is unable to take advantage of this workstealing support. The functionality is only available through Tasks, for which it is turned on by default. This behavior can be disabled on a per-Task basis using TaskCreationOptions.PreferFairness. By default, Tasks execute in the ThreadPool using these internal work-stealing queues. This functionality isnt hardwired into Tasks, however. Rather, the functionality is abstracted through the TaskScheduler type. Tasks execute on TaskSchedulers, and the .NET Framework 4 comes with a built-in TaskScheduler that targets this functionality in the ThreadPool; this implementation is whats returned from the TaskScheduler.Default property, and as this propertys name implies, this is the default scheduler used by Tasks. As with anything where someone talks about a default, theres usually a mechanism to override the default, and that does in fact exist for Task execution. Its possible to write custom TaskScheduler implementations to execute Tasks in whatever manner is needed by the application. TaskScheduler itself embodies the concept of producer/consumer. As an abstract class, it provides several abstract methods that must be overridden and a few virtual methods that may be. The primary abstract method is called QueueTask, and is used by the rest of the .NET Framework infrastructure, acting as the producer, to queue tasks into the scheduler. The scheduler implementation then acts as the consumer, executing those tasks in whatever manner it sees fit. We can build a very simple, no frills TaskScheduler, based on the previously shown SimpleThreadPool, simply by delegating from QueueTask to QueueWorkItem, using a delegate that executes the task:

Patterns of Parallel Programming

Page 61

Visual Basic
Public NotInheritable Class SimpleThreadPoolTaskScheduler Inherits TaskScheduler Protected Overrides Sub QueueTask(ByVal task As Task) SimpleThreadPool.QueueWorkItem(Function() MyBase.TryExecuteTask(task)) End Sub Protected Overrides Function TryExecuteTaskInline(ByVal task As Task, ByVal taskWasPreviouslyQueued As Boolean) As Boolean Return MyBase.TryExecuteTask(task) End Function Protected Overrides Function GetScheduledTasks() As IEnumerable(Of Task) Throw New NotSupportedException() End Function End Class

We can then produce tasks to be run on an instance of this scheduler:

Visual Basic
Dim myScheduler = New SimpleThreadPoolTaskScheduler() Dim t = New Task(Sub() Console.WriteLine("hello, world")) t.Start(myScheduler)

The TaskFactory class, a default instance of which is returned from the Shared Task.Factory property, may also be instantiated with a TaskScheduler instance. This then allows us to easily utilize all of the factory methods while targeting a custom scheduler:

Visual Basic
Dim factory = New TaskFactory(New SimpleThreadPoolTaskScheduler()) factory.StartNew(Sub() Console.WriteLine("hello, world"))

UI MARSHALING
If youve written a responsive Windows-based application, youve already taken advantage of the producer/consumer pattern. With both Windows Forms and Windows Presentation Foundation (WPF), UI controls must only be accessed from the same thread that created them, a form of thread affinity. This is problematic for several reasons, one of the most evident having to do with UI responsiveness. To write a response application, its typically necessary to offload work from the UI thread to a background thread, in order to allow that UI thread to continue processing Windows messages that cause the UI to repaint, to respond to mouse input, and so on. That processing occurs with code referred to as a Windows message loop. While the work is executing in the background, it may need to update visual progress indication in the UI, and when it completes, it may need to refresh the UI in some manner. Those interactions often require the manipulation of controls that were created on the UI thread, and as a result, the background thread must marshal calls to those controls to the UI thread. Both Windows Forms and WPF provide mechanisms for doing this. Windows Forms provides the instance Invoke method on the Control class. This method accepts a delegate, and marshals the execution of that delegate to the right thread for that Control, as demonstrated in the following Windows-based application that updates a label on the UI thread every second:

Patterns of Parallel Programming

Page 62

Visual Basic
Imports Imports Imports Imports System System.Drawing System.Threading System.Windows.Forms

NotInheritable Class Program <STAThread()> Shared Sub Main(ByVal args() As String) Dim form = New Form() Dim lbl = New Label() With { .Dock = DockStyle.Fill, .TextAlign = ContentAlignment.MiddleCenter} form.Controls.Add(lbl) Dim handle = form.Handle ThreadPool.QueueUserWorkItem( Sub(state) While (True) lbl.Invoke(Sub() lbl.Text = Date.Now.ToString()) Thread.Sleep(1000) End While End Sub) form.ShowDialog() End Sub End Class

The Invoke call is synchronous, in that it wont return until the delegate has completed execution . There is also a BeginInvoke method, which runs the delegate asynchronously. This mechanism is itself a producer/consumer implementation. Windows Forms maintains a queue of delegates to be processed by the UI thread. When Invoke or BeginInvoke is called, it puts the delegate into this queue, and sends a Windows message to the UI thread. The UI threads message loop eventually processes this message, which tells it to dequeue a delegate from the queue and execute it. In this manner, the thread calling Invoke or BeginInvoke is the producer, the UI thread is the consumer, and the data being produced and consumed is the delegate. The particular pattern of producer/consumer employed by Invoke has a special name, rendezvous, which is typically used to signify multiple threads that meet to exchange data bidirectionally. The caller of Invoke is providing a delegate and is potentially getting back the result of that delegates invocation . The UI thread is receiving a delegate and is potentially handing over the delegates result . Neither thread may progress past the rendezvous point until the data has been fully exchanged. This producer/consumer mechanism is available for WPF as well, through the Dispatcher class, which similarly provides Invoke and BeginInvoke methods. To abstract away this functionality and to make it easier to write components that need to marshal to the UI and that must be usable in multiple UI environments, the .NET Framework provides the SynchronizationContext class. SynchronizationContext provides Send and Post methods, which map to Invoke and BeginInvoke, respectively. Windows Forms provides an internal SynchronizationContext-derived type called WindowsFormsSynchronizationContext, which overrides Send to call Patterns of Parallel Programming Page 63

Control.Invoke and which overrides Post to call Control.BeginInvoke. WPF provides a similar type. With this in hand, a library can be written in terms of SynchronizationContext, and can then be supplied with the right SynchronziationContext at runtime to ensure its able to marshal appropriately to the UI in the current environment. SynchronizationContext may also be used for other purposes, and in fact there are other implementations of it provided in the .NET Framework for non-UI related purposes. For this discussion, however, well continue to refer to SynchronizationContext pertaining only to UI marshaling. To facilitate this, the Shared SynchronizationContext.Current property exists to help code grab a reference to a SynchronizationContext that may be used to marshal to the current thread. Both Windows Forms and WPF set this property on the UI thread to the relevant SynchronizationContext instance. Code may then get the value of this property and use it to marshal work back to the UI. As an example, I can rewrite the previous example by using SynchronizationContext.Send rather than explicitly using Control.Invoke:

Visual Basic
<STAThread()> Shared Sub Main(ByVal args() As String) Dim form = New Form() Dim lbl = New Label() With { .Dock = DockStyle.Fill, .TextAlign = ContentAlignment.MiddleCenter} form.Controls.Add(lbl) Dim handle = form.Handle Dim sc = SynchronizationContext.Current ThreadPool.QueueUserWorkItem( Sub() While (True) sc.Send(Sub() lbl.Text = Date.Now.ToString(), Nothing) Thread.Sleep(1000) End While End Sub) form.ShowDialog() End Sub

As mentioned in the previous section, custom TaskScheduler types may be implemented to supply custom consumer implementations for Tasks being produced. In addition to the default implementation of TaskScheduler that targets the .NET Framework ThreadPools internal work-stealing queues, the .NET Framework 4 also includes the TaskScheduler.FromCurrentSynchronizationContext method, which generates a TaskScheduler that targets the current synchronization context. We can then take advantage of that functionality to further abstract the previous example:

Patterns of Parallel Programming

Page 64

Dim ui = new TaskFactory( TaskScheduler.FromCurrentSynchronizationContext()) ThreadPool.QueueUserWorkItem( Sub() While (True) ui.StartNew(Sub() lbl.Text = Date.Now.ToString()) Thread.Sleep(1000) End While End Sub) form.ShowDialog() End Sub

This ability to execute Tasks in various contexts also integrates very nicely with continuations and dataflow, for example:

Visual Basic
Task.Factory.StartNew(Function() Run in the background a long computation which generates a result Return DoLongComputation() End Function).ContinueWith(Sub(t) Render the result on the UI RenderResult(t.Result) End Sub, TaskScheduler.FromCurrentSynchronizationContext())

SYSTEM EVENTS
The Microsoft.Win32.SystemEvents class exposes a plethora of Shared events for being notified about happenings in the system, for example:

Visual Basic
Public Public Public Public Public Public Public Public Public Public Public Public Public Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Event Event Event Event Event Event Event Event Event Event Event Event Event DisplaySettingsChanged As EventHandler DisplaySettingsChanging As EventHandler EventsThreadShutdown As EventHandler InstalledFontsChanged As EventHandler PaletteChanged As EventHandler PowerModeChanged As PowerModeChangedEventHandler SessionEnded As SessionEndedEventHandler SessionEnding As SessionEndingEventHandler SessionSwitch As SessionSwitchEventHandler TimeChanged As EventHandler TimerElapsed As TimerElapsedEventHandler UserPreferenceChanged As UserPreferenceChangedEventHandler UserPreferenceChanging As UserPreferenceChangingEventHandler

The Windows operating system notifies applications of the conditions that lead to most of these events through Windows messages, as discussed in the previous section. To receive these messages, the application must make sure it has a window to which the relevant messages can be broadcast, and a message loop running to process them. Thus, if you subscribe to one of these events, even in an application without UI, SystemEvents ensures that a broadcast window has been created and that a thread has been created to run a message loop for it. That thread then waits for messages to arrive and consumes them by translating them into the proper .NET Framework objects and invoking the relevant event. When you register an event handler with an event on SystemEvents, in a strong sense youre then implementing the consumer side of this multithreaded, producer/consumer implementation. Patterns of Parallel Programming Page 65

AGGREGATIONS
Combining data in one way or another is very common in applications, and aggregation is an extremely common need in parallel applications. In parallel systems, work is divided up, processed in parallel, and the results of these intermediate computations are then combined in some manner to achieve a final output. In some cases, no special work is required for the last step. For example, if a parallel for loop iterates from 0 to N, and the ith result is stored into the resulting arrays ith slot, the aggregation of results into the output array can be done in parallel with no additional work: the locations in the output array may all be written to independently, and no two parallel iterations will attempt to store into the same index. In many cases, however, special work is required to ensure that the results are aggregated safely. There are several common patterns for achieving such aggregations.

OUTPUTTING A SET OF RESULTS

A common coding pattern in sequential code is of the following form, where some input data is processed, and the results are stored into an output collection:

Visual Basic
Dim output = New List(Of TOutput)() For Each item In input Dim result = Compute(item) output.Add(result) Next item

If the size of the input collection is known in advance, this can be converted into an instance of the aforementioned example, where the results are stored directly into the corresponding slots in the output:

Visual Basic
Dim output = New TOutput(input.Count - 1) {} For i = 0 To input.Count - 1 Dim result = Compute(Input(i)) output(i) = result Next i

This then makes parallelization straightforward, at least as it pertains to aggregation of the results:

Visual Basic
Dim output = New TOutput(Input.Count - 1) Parallel.For(0, input.Count, Sub(i) Dim result = Compute(Input(i)) output(i) = result End Sub)

However, this kind of transformation is not always possible. In cases where the input size is not known or where the input collection may not be indexed into, an output collection is needed that may be modified from multiple threads. This may be done using explicit synchronization to ensure the output collection is only modified by a single thread at a time:

Patterns of Parallel Programming

Page 66

Visual Basic
Dim output = New List(Of TOutput)() Parallel.ForEach(Input, Sub(item) Dim result = Compute(item) SyncLock output output.Add(result) End SyncLock End Sub)

If the amount of computation done per item is significant, the cost of this locking is likely to be negligible. However, as the amount of computation per item decreases, the overhead of taking and releasing a lock becomes more relevant, and contention on the lock increases as more threads are blocked waiting to acquire it concurrently. To decrease these overheads and to minimize contention, the new thread-safe collections in the .NET Framework 4 may be used. These collections reside in the System.Collections.Concurrent namespace, and are engineered to be scalable, minimizing the impact of contention. Some of these collections are implemented with lock-free techniques, while others are implemented using fine-grained locking. Amongst these new collections, theres no direct corollary to the List<T> type. However, there are several collections that address many of the most common usage patterns for List<T>. If you reexamine the previous code snippet, youll notice that the output ordering from the serial code is not necessarily maintained in the parallel version. This is because the order in which the data is stored into the output list is no longer based solely on the order of the data in the input, but also on the order in which the parallel loop chooses to process the elements, how partitioning occurs, and how long each element takes to process. Once weve accepted this issue and have coded the rest of the application to not rely on the output ordering, our choices expand for what collection to use to replace the list. Here Ill use the new ConcurrentBag<T> type:

Visual Basic
Dim output = New ConcurrentBag(Of TOutput)() Parallel.ForEach(Input, Sub(item) Dim result = Compute(item) output.Add(result) End Sub)

All of the synchronization necessary to ensure the consistency of the output data structure is handled internally by the ConcurrentBag.

OUTPUTTING A SINGLE RESULT

Many algorithms output a single result, rather than a single collection. For example, consider the following serial routine to estimate the value of Pi:

Visual Basic
Private Const NUM_STEPS As Integer = 100000000 Shared Function SerialPi() Dim sum = 0.0 Dim [step] = 1.0 / CDbl(NUM_STEPS) For i = 0 To NUM_STEPS - 1 Dim x = (i + 0.5) * [step]

Patterns of Parallel Programming

Page 67

Dim [partial] = 4.0 / (1.0 + x * x) sum += [partial] Next i Return [step] * sum End Function

The output of this operation is a single double value. This value is the sum of millions of independent operations, and thus should be parallelizable. Here is a nave parallelization:

Visual Basic
Shared Function NaiveParallelPi() As Double Dim sum As Double = 0.0 Dim [step] As Double = 1.0 / CDbl(NUM_STEPS) Dim obj As New Object() Parallel.For(0, NUM_STEPS, Sub(i) Dim x = (i + 0.5) * [step] Dim [partial] = 4.0 / (1.0 + x * x) SyncLock obj sum += [partial] End SyncLock End Sub) Return [step] * sum End Function

We say nave here, because while this solution is correct, it will also be extremely slow . Every iteration of the parallel loop does only a few real cycles worth of work, made up of a few additions, multiplications, and divisions, and then takes a lock to accumulate that iterations result into the overall result . The cost of that lock will dominate all of the other work happening in the parallel loop, largely serializing it, such that parallel version will likely run significantly slower than the sequential. To fix this, we need to minimize the amount of synchronization necessary. That can be achieved by maintaining local sums. We know that certain iterations will never be in conflict with each other, namely those running on the same underlying thread (since a thread can only do one thing at a time), and thus we can maintain a local sum per thread or task being used under the covers in Parallel.For. Given the prevalence of this pattern, Parallel.For actually bakes in support for it. In addition to passing to Parallel.For a delegate for the body, you can also pass in a delegate that represents an initialization routine to be run on each task used by the loop, and a delegate that represents a finalization routine that will be run at the end of the task when no more iterations will be executed in it.

Visual Basic
Public Shared Function [For](Of TLocal)( ByVal fromInclusive As Integer, ByVal toExclusive As Integer, ByVal localInit As Func(Of TLocal), ByVal body As Func(Of Integer, ParallelLoopState, TLocal, TLocal), ByVal localFinally As Action(Of TLocal)) As ParallelLoopResult

The result of the initialization routine is passed to the first iteration run by that task, the output of that iteration is passed to the next iteration, the output of that iteration is passed to the next, and so on, until finally the last iteration passes its result to the localFinally delegate.

Patterns of Parallel Programming

Page 68

Parallel.For Task 1 localInit Iteration A Iteration B ... Iteration N localFinally ... Task N localInit Iteration A Iteration B ... Iteration N localFinally

In this manner, a partial result can be built up on each task, and only combined with the partials from other tasks at the end. Our Pi example can thusly be implemented as follows:

Visual Basic
Private Shared Function ParallelPi() As Double Dim sum As Double = 0.0R Dim [step] As Double = 1.0R / CDbl(NUM_STEPS) Dim obj As New Object() Parallel.[For](0, NUM_STEPS, Function() 0.0R, Function(i, state, [partial]) Dim x = (i + 0.5) * [step] Return [partial] + 4.0R / (1.0R + x * x) End Function, Sub([partial]) SyncLock obj sum += [partial] End SyncLock End Sub) Return [step] * sum End Function

The localInit delegate returns an initialized value of 0.0. The body delegate calculates its iterations result, adds it to the partial result it was passed in (which either directly from the result of localInit or from the previous iteration on the same task), and returns the updated partial. The localFinally delegate takes the completed partial, and only then synchronizes with other threads to combine the partial sum into the total sum. Earlier in this document we saw the performance ramifications of having a very small delegate body. This Pi calculation is an example of that case, and thus we can likely achieve better performance using the batching pattern described previously.

Patterns of Parallel Programming

Page 69

Visual Basic
Private Shared Function ParallelPartitionerPi() As Double Dim sum As Double = 0.0R Dim [step] As Double = 1.0R / CDbl(NUM_STEPS) Dim obj As New Object() Parallel.ForEach(Partitioner.Create(0, NUM_STEPS), Function() 0.0R, Function(range, state, [partial]) For i = range.Item1 To range.Item2 - 1 Dim x = (i + 0.5) * [step] [partial] += 4.0R / (1.0R + x * x) Next Return [partial] End Function, Sub([partial]) SyncLock obj sum += [partial] End SyncLock End Sub) Return [step] * sum End Function

PLINQ AGGREGATIONS
Any time you find yourself needing to aggregate, think PLINQ. For many problems, aggregation is one of several areas in which PLINQ excels, with a plethora of aggregation support built-in.

TOARRAY / TOLIST / TODICTIONARY / TOLOOKUP

As does LINQ to Objects, PLINQ provides four To* methods that may be used to aggregate all of the output from a query into a single data structure. PLINQ internally handles all of the relevant synchronization. For example, here is the previous example of storing all results into a List<T>:

Visual Basic
Dim output = New List(Of TOutput)() For Each item In input Dim result = Compute(item) output.Add(result) Next item

This may be converted to a LINQ implementation as follows:

Visual Basic
Dim output = input .Select(Function(item) Compute(item)) .ToList()

And then it can be parallelized with PLINQ:

Visual Basic
Dim output = input.AsParallel() .Select(Function(item) Compute(item))

Patterns of Parallel Programming

Page 70

.ToList()

In fact, not only does PLINQ handle all of the synchronization necessary to do this aggregation safely, it can also be used to automatically regain the ordering we lost in our parallelized version when using Parallel.ForEach:

Visual Basic
Dim output = input.AsParallel().AsOrdered() .Select(Function(item) Compute(item)) .ToList()

SINGLE-VALUE AGGREGATIONS
Just as LINQ and PLINQ are useful for aggregating sets of output, they are also quite useful for aggregating down to a single value, with operators including but not limited to Average, Sum, Min, Max, and Aggregate. As an example, the same Pi calculation can be done using LINQ:

Visual Basic
Shared Function SerialLinqPi() As Double Dim [step] = 1.0 / CDbl(NUM_STEPS) Return Enumerable.Range(0, NUM_STEPS).Select( Function(i) Dim x = (i + 0.5) * [step] Return 4.0 / (1.0 + x * x) End Function).Sum() * [step] End Function

With a minimal modification, PLINQ can be used to parallelize this:

Visual Basic
Shared Function ParallelLinqPi() As Double Dim [step] = 1.0 / CDbl(NUM_STEPS) Return ParallelEnumerable.Range(0, NUM_STEPS).Select( Function(i) Dim x = (i + 0.5) * [step] Return 4.0 / (1.0 + x * x) End Function).Sum() * [step] End Function

This parallel implementation does scale nicely as compared to the serial LINQ version. However, if you test the serial LINQ version and compare its performance against the previously shown serial for loop version, youll find that the serial LINQ version is significantly more expensive; this is largely due to all of the extra delegate invocations involved in its execution. We can create a hybrid solution that utilizes PLINQ to creation partitions and sum partial results but creates the individual partial results on each partition using a for loop:

Visual Basic
Shared Function ParallelPartitionLinqPi() As Double Dim [step] = 1.0 / CDbl(NUM_STEPS) Return Partitioner.Create(0, NUM_STEPS).AsParallel().Select( Function(range) Dim [partial] = 0.0 For i = range.Item1 To range.Item2 - 1

Patterns of Parallel Programming

Page 71

Dim x = (i + 0.5) * [step] [partial] += 4.0 / (1.0 + x * x) Next i Return [partial] End Function).Sum() * [step] End Function

AGGREGATE
Both LINQ and PLINQ may be used for arbitrary aggregations using the Aggregate method. Aggregate has several overloads, including several unique to PLINQ that provide more support for parallelization. PLINQ assumes that the aggregation delegates are both associative and commutative; this limits the kinds of operations that may be performed, but also allows PLINQ to optimize its operation in ways that wouldnt otherwise be possible if it couldnt make these assumptions. The most advanced PLINQ overload of Aggregate is very similar in nature and purpose to the Parallel.ForEach overload that supports localInit and localFinally delegates:

Visual Basic
<Extension()> Public Function Aggregate(Of TSource, TAccumulate, TResult)( ByVal source As ParallelQuery(Of TSource), ByVal seedFactory As Func(Of TAccumulate), ByVal updateAccumulatorFunc As Func(Of TAccumulate, TSource, TAccumulate), ByVal combineAccumulatorsFunc As Func( Of TAccumulate, TAccumulate, TAccumulate), ByVal resultSelector As Func(Of TAccumulate, TResult)) As TResult

The seedFactory delegate is the logical equivalent of localInit, executed once per partition to provide a seed for the aggregation accumulator on that partition. The updateAccumulatorFunc is akin to the body delegate, provided with the current value of the accumulator and the current element, and returning the updated accumulator value based on incorporating the current element. The combineAccumulatorsFunc is logically equivalent to the localFinally delegate, combining the results from multiple partitions (unlike localFinally, which is given the current tasks final value and may do with it what it chooses, this delegate accepts two accumulator values and returns the aggregation of the two). And finally, the resultSelector takes the total accumulation and processes it into a result value. In many scenarios, TAccumulate will be TResult, and this resultSelector will simply return its input. As a concrete case for where this aggregation operator is useful, consider a common pattern: the need to take the best N elements output from a query. An example of this might be in a spell checker. Given an input word list, compare the input text against each word in the dictionary and compute a distance metric between the two. We then want to select out the best results to be displayed to the user as options. One approach to implementing this with PLINQ would be as follows:

Visual Basic
Dim bestResults = dictionaryWordList .Select(Function(word) New With { Key .Word = word, Key .Distance = GetDistance(word, Text)}) .TakeTop(Function(p) -p.Distance, NUM_RESULTS_TO_RETURN) .Select(Function(p) p.Word) .ToList()

Patterns of Parallel Programming

Page 72

In the previous example, TakeTop is implemented as:

Visual Basic
<Extension()> Public Function TakeTop(Of TSource, TKey)( ByVal source As ParallelQuery(Of TSource), ByVal keySelector As Func(Of TSource, TKey), ByVal count As Integer) As IEnumerable(Of TSource) Return source.OrderBy(keySelector).Take(count) End Function

The concept of take the top N here is implemented by first sorting all of the result using OrderBy and then taking the first N results. This may be overly expensive, however. For a large word list of several hundred thousand words, were forced to sort the entire result set, and sorting has relatively high computational complexity. If were only selecting out a handful of results, we can do better. For example, in a sequential implementation we could simply walk the result set, keeping track of the top N along the way. We can implement this in parallel by walking each partition in a similar manner, keeping track of the best N from each partition. An example implementation of this approach is included in the Parallel Extensions samples at http://code.msdn.microsoft.com/ParExtSamples, and the relevant portion is shown here:

Visual Basic
Return source.Aggregate( seedFactory Function() New SortedTopN(Of TKey, TSource)(count), updateAccumulatorFunc Function(accum, item) accum.Add(keySelector(item), item) Return accum End Function, combineAccumulatorFunc Function(accum1, accum2) For Each item In accum2 accum1.Add(item) Next Return accum1 End Function, resultSelector Function(accum) accum.Values())

The seedFactory delegate, called once for each partition, generates a new data structure to keep track of the top count items added to it. Up until count items, all items added to the collection get stored. Beyond that, every time a new item is added, its compared against the least item currently stored, and if its greater than it, the least item is bumped out and the new item is stored in its place. The updateAccumulatorFunc simply adds the current item to the data structure accumulator (according to the rules of only maintaining the top N). The combineAccumulatorsFunc combines two of these data structures by adding all of the elements from one to the other and then returning that end result. And the resultSelector simply returns the set of values from the ultimate resulting accumulator.

Patterns of Parallel Programming

Page 73

MAPREDUCE
The MapReduce pattern was introduced to handle large-scale computations across a cluster of servers, often involving massive amounts of data. The pattern is relevant even for a single multi-core machine, however. Here is a description of the patterns core algorithm: The computation takes a set of input key/value pairs, and produces a set of output key/ value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's Reduce function via an iterator. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107-113. DOI= http://doi.acm.org/10.1145/1327452.1327492

IMPLEMENTING MAPREDUCE WITH PLINQ

The core MapReduce pattern (and many variations on it) is easily implemented with LINQ, and thus with PLINQ. To see how, well break apart the description of the problem as shown previously. The description of the Map function is that it takes a single input value and returns a set of mapped values: this is the purpose of LINQs SelectMany operator, which is defined as follows:

Visual Basic
<Extension()> Public Function SelectMany(Of TSource, TResult)( ByVal source As IEnumerable(Of TSource), ByVal selector As Func(Of TSource, IEnumerable(Of TResult))) As IEnumerable(Of TResult)

Moving on, the MapReduce problem description highlights that results are then grouped according to an intermediate key. That grouping operation is the purpose of the LINQ GroupBy operator:

Visual Basic
<Extension()> Public Function GroupBy(Of TSource, TKey)( ByVal source As IEnumerable(Of TSource), ByVal keySelector As Func(Of TSource, TKey)) As IEnumerable(Of IGrouping(Of TKey, TSource))

Finally, a reduction is performed by a function that takes each intermediate key and a set of values for that key, and produces any number of outputs per key. Again, thats the purpose of SelectMany. We can put all of this together to implement MapReduce in LINQ: Patterns of Parallel Programming Page 74

Visual Basic
<Extension()> Public Function MapReduce(Of TSource, TMapped, TKey, TResult)( ByVal source As IEnumerable(Of TSource), ByVal map As Func(Of TSource, IEnumerable(Of TMapped)), ByVal keySelector As Func(Of TMapped, TKey), ByVal reduce As Func(Of IGrouping(Of TKey, TMapped), IEnumerable(Of TResult))) As IEnumerable(Of TResult) Return source.SelectMany(map) .GroupBy(keySelector) .SelectMany(reduce) End Function

Parallelizing this new combined operator with PLINQ is as simply as changing the input and output types to work with PLINQs ParallelQuery<> type instead of with LINQs IEnumerable<>:

Visual Basic
<Extension()> Public Function MapReduce(Of TSource, TMapped, TKey, TResult)( ByVal source As ParallelQuery(Of TSource), ByVal map As Func(Of TSource, IEnumerable(Of TMapped)), ByVal keySelector As Func(Of TMapped, TKey), ByVal reduce As Func(Of IGrouping(Of TKey, TMapped), IEnumerable(Of TResult))) As ParallelQuery(Of TResult) Return source.SelectMany(map).GroupBy(keySelector).SelectMany(reduce) End Function

USING MAPREDUCE
The typical example used to demonstrate a MapReduce implementation is a word counting routine, where a bunch of documents are parsed, and the frequency of all of the words across all of the documents is summarized. For this example, the map function takes in an input document and outputs all of the words in that document. The grouping phase groups all of the identical words together, such that the reduce phase can then count the words in each group and output a word/count pair for each grouping:

Visual Basic
Dim files = Directory.EnumerateFiles(dirPath, "*.txt").AsParallel() Dim counts = files.MapReduce( Function(path) File.ReadLines(path).SelectMany( Function(line) line.Split(delimiters)), Function(word) word, Function(group) {New KeyValuePair(Of String, Integer)( group.Key, group.Count())})

The tokenization here is done in a nave fashion using the String.Split function, which accepts the list of characters to use as delimiters. For this example, that list was generated using another LINQ query that generates an array of all of the ASCII white space and punctuation characters:

Visual Basic
Shared delimiters() As Char = Enumerable.Range(0, 256).Select(Function(i) ChrW(i)) .Where(Function(c) Char.IsWhiteSpace(c) OrElse Char.IsPunctuation(c)) .ToArray()

Patterns of Parallel Programming

Page 75

DEPENDENCIES
A dependency is the Achilles heel of parallelism. A dependency between two operations implies that one operation cant run until the other operation has completed, inhibiting parallelism . Many real-world problems have implicit dependencies, and thus its important to be able to accommodate them and extract as much parallelism as is possible. With the producer/consumer pattern, weve already explored one key solution to specific kinds of dependencies. Here well examine others.

DIRECTED ACYCLIC GRAPHS

Its very common in real-world problems to see patterns where dependencies between components form a directed acyclic graph (DAG). As an example of this, consider compiling a solution of eight code projects. Some projects have references to other projects, and thus depend on those projects being built first. The dependencies are as follows: Components 1, 2, and 3 depend on nothing else in the solution. Component 4 depends on 1. Component 5 depends on 1, 2, and 3. Component 6 depends on 3 and 4. Component 7 depends on 5 and 6 and has no dependencies on it. Component 8 depends on 5 and has no dependencies on it.

This set of dependencies forms the following DAG (as rendered by the new Architecture tools in Visual Studio 2010):

If building each component is represented as a Task, we can take advantage of continuations to express as much parallelism as is possible:

Visual Basic
Dim f = Task.Factory; Dim build1 = f.StartNew(Sub() Build(project1)); Dim build2 = f.StartNew(Sub() Build(project2));

Patterns of Parallel Programming

Page 76

Dim build3 = f.StartNew(Sub() Build(project3)); Dim build4 = f.ContinueWhenAll({build1}, Sub(tasks) Build(project4)); Dim build5 = f.ContinueWhenAll({build1, build2, build3}, Sub(tasks) Build(project5)); Dim build6 = f.ContinueWhenAll({build3, build4}, Sub(tasks) Build(project6)); Dim build7 = f.ContinueWhenAll({build5, build6}, Sub(tasks) Build(project7)); Dim build8 = f.ContinueWhenAll({build5}, Sub(tasks) Build(project8)); Task.WaitAll(build1, build2, build3, build4, build5, build6, build7, build8);

With this code, we immediately queue up work items to build the first three projects. As those projects complete, projects with dependencies on them will be queued to build as soon as all of their dependencies are satisfied.

ITERATING IN LOCK STEP

A common pattern in many algorithms is to have a series of operations that need to be done, from 0 to N, where step i+1 cant realistically be processed until step i has completed. This often occurs in image processing algorithms, where processing one scan line of the image depends on the previous scan line having already been processed. This also frequently occurs in analysis of a system over time, where each iteration represents another step forward in time, and the world at iteration i+1 depends on the state of the world after iteration i. An example of the latter is in simple modeling of the dissipation of heat across a metal plate, exemplified by the following sequential code:

Visual Basic
Private Function SequentialSimulation( ByVal plateSize As Integer, ByVal timeSteps As Integer) As Single(,) ' Initial plates for previous and current time steps, with ' heat starting on one side. Dim prevIter = New Single(plateSize - 1, plateSize - 1) {} Dim currIter = New Single(plateSize - 1, plateSize - 1) {} For y = 0 To plateSize - 1 prevIter(y, 0) = 255.0F Next y ' Run simulation For [step] = 0 To timeSteps - 1 For y = 1 To plateSize - 2 For x = 1 To plateSize - 2 currIter(y, x) = ((prevIter(y, x - 1) + prevIter(y, x + 1) + prevIter(y - 1, x) + prevIter(y + 1, x)) * 0.25F) Next x Next y Swap(prevIter, currIter) Next [step] ' Return results Return prevIter End Function Private Shared Sub Swap(Of T)(ByRef one As T, ByRef two As T) Dim tmp As T = one

Patterns of Parallel Programming

Page 77

one = two two = tmp End Sub

On close examination, youll see that this can actually be expressed as a DAG, since the cell [y,x] for time step i+1 can be computed as soon as the cells [y,x-1], [y,x+1], [y-1,x], and [y+1,x] from time step i are completed. However, attempting this kind of parallelization can lead to significant complications. For one, the amount of computation required per cell is very small, just a few array accesses, additions, and multiplications; creating a new Task for such an operation is respectively a lot of overhead. Another significant complication is around memory management. In the serial scheme shown, we only need to maintain two plate arrays, one storing the previous iteration and one storing the current. Once we start expressing the problem as a DAG, we run into issues of potentially needing plates (or at least portions of plates) for many generations. An easier solution is simply to parallelize one or more of the inner loops, but not the outer loop. In effect, we can parallelize each step of the simulation, just not all time steps of the simulation concurrently:

Visual Basic
' Run simulation For [step] = 0 To timeSteps - 1 Parallel.[For](1, plateSize - 1, Sub(y) For x = 1 To plateSize - 2 currIter(y, x) = ((prevIter(y, x - 1) + prevIter(y, x + 1) + prevIter(y - 1, x) + prevIter(y + 1, x)) * 0.25F) Next End Sub) Swap(prevIter, currIter) Next

Typically, this approach will be sufficient. For some kinds of problems, however, it can be more efficient (largely for reasons of cache locality) to ensure that the same thread processes the same sections of iteration space on each time step. We can accomplish that by using Tasks directly, rather than by using Parallel.For. For this heated plate example, we spin up one Task per processor and assign each a portion of the plates size; each Task is responsible for processing that portion at each time step. Now, we need some way of ensuring that each Task does not go on to process its portion of the plate at iteration i+1 until all tasks have completed processing iteration i. For that purpose, we can use the System.Threading.Barrier class thats new to the .NET Framework 4:

Visual Basic
' Run simulation Dim numTasks = Environment.ProcessorCount Dim tasks = New Task(numTasks - 1) {} Dim stepBarrier = New Barrier(numTasks, Sub(b) Swap(prevIter, currIter)) Dim chunkSize = (plateSize - 2) / numTasks For i = 0 To numTasks - 1 Dim yStart = 1 + (chunkSize * i) Dim yEnd = If((i = numTasks - 1), plateSize - 1, yStart + chunkSize) tasks(i) = Task.Factory.StartNew( Sub() For [step] = 0 To timeSteps - 1 For y = yStart To yEnd - 1 For x = 1 To plateSize - 2 currIter(y, x) = ((prevIter(y, x - 1) + prevIter(y, x + 1) + prevIter(y - 1, x) + prevIter(y + 1, x)) * 0.25F)

Patterns of Parallel Programming

Page 78

Next Next stepBarrier.SignalAndWait() Next End Sub) Next Task.WaitAll(tasks)

Each Task calls the Barriers SignalAndWait method at the end of each time step, and the Barrier ensures that no tasks progress beyond this point in a given iteration until all tasks have reached this point for that iteration. Further, because we need to swap the previous and current plates at the end of every time step, we register that swap code with the Barrier as a post-phase action delegate; the Barrier will run that code on one thread once all Tasks have reached the Barrier in a given iteration and before it releases any Tasks to the next iteration.

DYNAMIC PROGRAMMING
Not to be confused with dynamic languages or with Visual Basics and C#s for dynamic invocation, dynamic programming in computer science is a classification for optimization algorithms that break down problems recursively into smaller problems, caching (or memoizing) the results of those subproblems for future use, rather than recomputing them every time theyre needed . Common dynamic programming problems include longest common subsequence, matrix-chain multiplication, string edit distance, and sequence alignment. Dynamic programming problems are ripe with dependencies, but these dependencies can be bested and typically dont prevent parallelization. To demonstrate parallelization of a dynamic programming program, consider a simple implementation of the Levenshtein edit distance algorithm:

Visual Basic
Shared Function EditDistance(ByVal s1 As String, ByVal s2 As String) As Integer Dim dist(s1.Length, s2.Length) For i = 0 To s1.Length dist(i, 0) = i Next i For j = 0 To s2.Length dist(0, j) = j Next j For i = 1 To s1.Length For j = 1 To s2.Length dist(i, j) = If((s1.Chars(i - 1) = s2.Chars(j - 1)), dist(i - 1, j - 1), 1 + Math.Min(dist(i - 1, j), Math.Min(dist(i, j - 1), dist(i - 1, j - 1)))) Next j Next i Return dist(s1.Length, s2.Length) End Function

This algorithm builds up a distance matrix, where the [i,j] entry represents the number of operations it would take to transform the first i characters of string s1 into the first j characters of s2; an operation is defined as a single character substitution, insertion, or deletion. To see how this works in action, consider computing the distance

Patterns of Parallel Programming

Page 79

between two strings, going from PARALLEL" to STEPHEN. We start by initializing the first row to the values 0 through 8 these represent deletions (going from P to requires 1 deletion, going from PA to requires 2 deletions, going from PAR to requires 3 deletions, and so on). We also initialize the first column to the values 0 through 7 these represent additions (going from to STEP requires 4 additions, going from to STEPHEN requires 7 additions, and so on). P 0 S T E P H E N 1 2 3 4 5 6 7 1 A 2 R 3 A 4 L 5 L 6 E 7 L 8

Now starting from cell [1,1] we walk down each column, calculating each cells value in order. Lets call the two strings s1 and s2. A cells value is based on two potential options: 1. The two characters corresponding with that cell are the same. The value for this cell is the same as the value for the diagonally previous cell, which represents comparing each of the two strings without the current letter (for example, if we already know the value for comparing STEPH and PARALL, the value for STEPHE and PARALLE is the same, as we added the same letter to the end of both strings, and thus the distance doesnt change). The two characters corresponding with that cell are different. The value for this cell is the minimum of three potential operations: a deletion, a substitution, or an insertion. These are represented by adding 1 to the value retrieved from the cells immediately above, diagonally to the upper-left, and to the left.

As an exercise, try filling in the table. The completed table for PARALLEL and STEPHEN is as follows:

Patterns of Parallel Programming

Page 80

P 0 S T E P H E N 1 2 3 4 5 6 7 1 1 2 3 3 4 5 6

A 2 2 2 3 4 4 5 6

R 3 3 3 3 4 5 5 6

A 4 4 4 4 4 5 6 6

L 5 5 5 5 5 5 6 7

L 6 6 6 6 6 6 6 7

E 7 7 7 6 7 7 6 7

L 8 8 8 7 7 8 7 7

As you filled it in, you should have noticed that the numbers were filled in almost as if a wavefront were moving through the table, since a cell [i,j] can be filled in as soon as the three cells [i-1,j-1], [i-1,j], and [i,j-1] are completed (and in fact, the completion of the cell above and to the left implies that the diagonal cell was also completed). From a parallel perspective, this should sound familiar, harkening back to our discussion of DAGs. We could, in fact, parallelize this problem using one Task per cell and multi-task continuations, but as with previous examples on dependencies, theres very little work being done per cell, and the overhead of creating a task for each cell would significantly outweigh the value of doing so. Youll notice, however, that there are macro versions of these micro problems: take any rectangular subset of the cells in the grid, and that rectangular subset can be completed when the rectangular block above it and to its left have completed. This presents a solution: we can block the entire matrix up into rectangular regions, run the algorithm over each block, and use continuations for dependencies between blocks. This amortizes the cost of the parallelization with tasks across all of the cells in each block, making a Task worthwhile as long as the block is big enough. Since the macro problem is the same as the micro, we can write one routine to work with this general pattern, dubbed the wavefront pattern; we can then write a small routine on top of it to deal with b locking as needed. Heres an implementation based on Tasks and continuations:

Visual Basic
Shared Sub Wavefront( ByVal numRows As Integer, ByVal numColumns As Integer, ByVal processRowColumnCell As Action(Of Integer, Integer)) ' ... Would validate arguments here ' Store the previous row of tasks as well as the previous task ' in the current row. Dim prevTaskRow(numColumns - 1) As Task Dim prevTaskInCurrentRow As Task = Nothing Dim dependencies = New Task(1) {} ' Create a task for each cell. For row = 0 To numRows - 1

Patterns of Parallel Programming

Page 81

prevTaskInCurrentRow = Nothing For column = 0 To numColumns - 1 ' In-scope locals for being captured in the task closures. Dim j = row, i = column ' Create a task with the appropriate dependencies. Dim curTask As Task If row = 0 AndAlso column = 0 Then ' Upper-left task kicks everything off, ' having no dependencies. curTask = Task.Factory.StartNew(Sub() processRowColumnCell(j, i)) ElseIf row = 0 OrElse column = 0 Then ' Tasks in the left-most column depend only on the task ' above them, and tasks in the top row depend only on ' the task to their left. Dim antecedent = If(column = 0, prevTaskRow(0), prevTaskInCurrentRow) curTask = antecedent.ContinueWith( Sub(p) ' Necessary only to propagate exceptions. p.Wait() processRowColumnCell(j, i) End Sub) Else ' row > 0 && column > 0 ' All other tasks depend on both the tasks above ' and to the left. dependencies(0) = prevTaskInCurrentRow dependencies(1) = prevTaskRow(column) curTask = Task.Factory.ContinueWhenAll(dependencies, Sub(ps) ' Necessary to propagate exceptions Task.WaitAll(ps) processRowColumnCell(j, i) End Sub) End If ' Keep track of the task just created for future iterations. prevTaskInCurrentRow = curTask prevTaskRow(column) = prevTaskInCurrentRow Next column Next row ' Wait for the last task to be done. prevTaskInCurrentRow.Wait() End Sub

While a non-trivial amount of code, its actually quite straightforward. We maintain an array of Tasks represented the previous row, and a Task represented the previous Task in the current row. We start by launching a Task to process the initial Task in the [0,0] slot, since it has no dependencies. We then walk each cell in each row, creating a continuation Task for each cell. In the first row or the first column, there is just one dependency, the previous cell in that row or the previous cell in that column, respectively. For all other cells, the continuation is based on the previous cell in both the current row and the current column. At the end, we just wait for the last Task to complete. With that code in place, we now need to support blocks, and we can layer another Wavefront function on top to support that:

Patterns of Parallel Programming

Page 82

Visual Basic
Shared Sub Wavefront( ByVal numRows As Integer, ByVal numColumns As Integer, ByVal numBlocksPerRow As Integer, ByVal numBlocksPerColumn As Integer, ByVal processBlock As Action(Of Integer, Integer, Integer, Integer)) ' ... Would validate arguments here ' Compute the size of each block. Dim rowBlockSize = numRows \ numBlocksPerRow Dim columnBlockSize = numColumns \ numBlocksPerColumn Wavefront(numBlocksPerRow, numBlocksPerColumn, Sub(row, column) Dim start_i = row * rowBlockSize Dim end_i = If(row < numBlocksPerRow - 1, start_i + rowBlockSize, numRows) Dim start_j = column * columnBlockSize Dim end_j = If(column < numBlocksPerColumn - 1, start_j + columnBlockSize, numColumns) processBlock(start_i, end_i, start_j, end_j) End Sub) End Sub

This code is much simpler. The function accepts the number of rows and number of columns, but also the number of blocks to use. The delegate now accepts four values, the starting and ending position of the block for both row and column. The function validates parameters, and then computes the size of each block. From there, it delegates to the Wavefront overload we previously implemented. Inside the delegate, it uses the provided row and column number along with the block size to compute the starting and ending row and column positions, and then passes those values down to the user-supplied delegate. With this Wavefront pattern implementation in place, we can now parallelize our EditDistance function with very little additional code:

Visual Basic
Shared Function ParallelEditDistance( ByVal s1 As String, ByVal s2 As String) As Integer Dim dist(s1.Length, s2.Length) For i = 0 To s1.Length dist(i, 0) = i Next i For j = 0 To s2.Length dist(0, j) = j Next j Dim numBlocks = Environment.ProcessorCount * 2 Wavefront(s1.Length, s2.Length, numBlocks, numBlocks, Sub(start_i, end_i, start_j, end_j) For i = start_i + 1 To end_i For j = start_j + 1 To end_j dist(i, j) = If((s1.Chars(i - 1) = s2.Chars(j - 1)), dist(i - 1, j - 1), 1 + Math.Min(dist(i - 1, j), Math.Min(dist(i, j - 1), dist(i - 1, j - 1)))) Next j

Patterns of Parallel Programming

Page 83

Next i End Sub) Return dist(s1.Length, s2.Length) End Function

For small strings, the parallelization overheads will outweigh any benefits. But for large strings, this parallelization approach can yield significant benefits.

FOLD AND SCAN

Sometimes a dependency is so significant, there is seemingly no way around it. One such example of this is a fold operation. A fold is typically of the following form:

Visual Basic
b(0) = a(0) For i = 1 To N - 1 b(i) = f(b(i - 1), a(i)) Next i

As an example, if the function f is addition and the input array is 1,2,3,4,5, the result of the fold will be 1,3,6,10,15. Each iteration of the fold operation is entirely dependent on the previous iteration, leaving little room for parallelism. However, as with aggregations, we can make an accommodation: if we guarantee that the f function is associative, that enables enough wiggle room to introduce some parallelism (many operations are associative, including the addition operation used as an example). With this restriction on the operation, its typically called a scan, or sometimes prefix scan. There are several ways a scan may be parallelized. An approach well show here is based on blocking. Consider wanting to parallel scan the input sequence of the numbers 1 through 20 using the addition operator on a quadcore machine. We can split the input into four blocks, and then in parallel, scan each block individually. Once that step has completed, we can pick out the top element from each block, and do a sequential, exclusive scan on just those four entries; in an exclusive scan, element b[i] is what element b[i-1] would have been in a regular (inclusive) scan, with b[0] initialized to 0. The result of this exclusive scan is that, for each block, we now have the accumulated value for the entry just before the block, and thus we can fold that value in to each element in the block. For that latter fold, again each block may be processed in parallel.

Patterns of Parallel Programming

Page 84

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Logically partition into blocks

11 12 13 14 15

16 17 18 19 20

Scan

10 15

13 21 30 40

11 23 36 50 65

16 33 51 70 90

Logically Gather Upper Entries

15 40 65 90 Exclusive Scan 0 15 55 120

Inclusive Scan Entry into Block

10 15

28 36 45 55

66 78 91 105 120

136 153 171 190 210

Here is an implementation of this algorithm. As with the heated plate example shown previously, were using one Task per block with a Barrier to synchronize all tasks across the three stages: 1. 2. 3. Scan each block in parallel. Do the exclusive scan of the upper value from each block serially. Scan the exclusive scan results into the blocks in parallel. One important thing to note about this parallelization is that it incurs significant overhead. In the sequential scan implementation, were executing the combiner function f approximately N times, where N is the number of entries. In the parallel implementation, were executing f approximately 2N times. As a result, while the operation may be parallelized, at least two cores are necessary just to break even. While there are several ways to enforce the serial nature of the second step, here were utilizing the Barriers postphase action delegate (a complete implementation is available at http://code.msdn.microsoft.com/ParExtSamples):

Patterns of Parallel Programming

Page 85

Visual Basic
Public Sub InclusiveScanInPlaceParallel(Of T)( ByVal arr() As T, ByVal [function] As Func(Of T, T, T)) Dim procCount = Environment.ProcessorCount Dim intermediatePartials(procCount - 1) As T Using phaseBarrier = New Barrier(procCount, Function() ExclusiveScanInPlaceSerial( intermediatePartials, [function], 0, intermediatePartials.Length)) ' Compute the size of each range. Dim rangeSize = arr.Length \ procCount Dim nextRangeStart = 0 ' Create, store, and wait on all of the tasks. Dim tasks = New Task(procCount - 1) {} Dim i = 0 Do While i < procCount ' Get the range for each task, then start it. Dim rangeNum = i Dim lowerRangeInclusive = nextRangeStart Dim upperRangeExclusive = If(i < procCount - 1, nextRangeStart + rangeSize, arr.Length) tasks(rangeNum) = Task.Factory.StartNew( Sub() ' Phase 1: Prefix scan assigned range. InclusiveScanInPlaceSerial(arr, [function], lowerRangeInclusive, upperRangeExclusive, 1) intermediatePartials(rangeNum) = arr(upperRangeExclusive - 1) ' Phase 2: One thread should prefix scan intermediaries. phaseBarrier.SignalAndWait() ' Phase 3: Incorporate partials. If rangeNum <> 0 Then For j = lowerRangeInclusive To upperRangeExclusive - 1 arr(j) = [function]( intermediatePartials(rangeNum), arr(j)) Next j End If End Sub) i += 1 nextRangeStart += rangeSize Loop Task.WaitAll(tasks) End Using End Sub

This demonstrates that parallelization may be achieved where dependences would otherwise appear to be an obstacle that cant be mitigated.

Patterns of Parallel Programming

Page 86

DATA SETS OF UNKNOWN SIZE

Most of the examples described in this document thus far center around data sets of known sizes: input arrays, input lists, and so forth. In many real-world problems, however, the size of the data set to be processed isnt known in advance. This may be because the data is coming in from an external source and hasnt all arrived yet , or it may because the data structure storing the data doesnt keep track of the size or doesnt store the data in a manner amenable to the size being relevant. Regardless of the reason, its important to be able to parallelize such problems.

STREAMING DATA
Data feeds are becoming more and more important in all areas of computing. Whether its a feed of ticker data from a stock exchange, a sequence of network packets arriving at a machine, or a series of mouse clicks being entered by a user, such data can be an important input to parallel implementations. Parallel.ForEach and PLINQ are the two constructs discussed thus far that work on data streams, in the form of enumerables. Enumerables, however, are based on a pull-model, such that both Parallel.ForEach and PLINQ are handed an enumerable from which they continually move next to get the next element . This is seemingly contrary to the nature of streaming data, where it hasnt all arrived yet, and comes in more of a push fashion rather than pull. However, if we think of this pattern as a producer/consumer pattern, where the streaming data is the producer and the Parallel.ForEach or PLINQ query is the consumer, a solution from the .NET Framework 4 becomes clear: we can use BlockingCollection. BlockingCollections GetConsumingEnumerable method provides an enumerable that can be supplied to either Parallel.ForEach or PLINQ. ForEach and PLINQ will both pull data from this enumerable, which will block the consumers until data is available to be processed. Conversely, as streaming data arrives in, that data may be added to the collection so that it may be picked up by the consumers.

Visual Basic
Private _streamingData As New BlockingCollection(Of T)() ' Parallel.ForEach Parallel.ForEach(_streamingData.GetConsumingEnumerablem, Sub(item) Process(item)) ' PLINQ Dim q = From item In _streamingData.GetConsumingEnumerable().AsParallel() ... Select item

There are several caveats to be aware of here, both for Parallel.ForEach and for PLINQ. Parallel.ForEach and PLINQ work on slightly different threading models in the .NET Framework 4. PLINQ uses a fixed number of threads to execute a query; by default, it uses the number of logical cores in the machine, or it uses the value passed to WithDegreeOfParallelism if one was specified. Conversely, Parallel.ForEach may use a variable number of threads, based on the ThreadPools support for injecting and retiring threads over time to best accommodate current workloads. For Parallel.ForEach, this means that its continually monitoring for new threads to be available to it, taking advantage of them when they arrive, and the ThreadPool is continually trying out injecting new threads into the pool and retiring threads from the pool to see whether more or fewer threads is beneficial. However, when passing the result of calling GetConsumingEnumerable as the data source to Parallel.ForEach, the threads used by the loop have the potential to block when the collection becomes empty. And a blocked thread may not be

Patterns of Parallel Programming

Page 87

released by Parallel.ForEach back to the ThreadPool for retirement or other uses. As such, with the code as shown above, if there are any periods of time where the collection is empty, the thread count in the process may steadily grow; this can lead to problematic memory usage and other negative performance implications. To address this, when using Parallel.ForEach in a streaming scenario, its best to place an explicit limit on the number of threads the loop may utilize: this can be done using the ParallelOptions type, and specifically its MaxDegreeOfParallelism field:

Visual Basic
Dim options = New ParallelOptions With {.MaxDegreeOfParallelism = 4} Parallel.ForEach(_streamingData.GetConsumingEnumerable(), options, Sub(item) Process(item))

By adding the bolded code above, the loop is now limited to at most four threads, avoiding the potential for significant thread consumption. Even if the collection is empty for a long period of time, the loop can block only four threads at most. PLINQ has a different set of caveats. It already uses a fixed number of threads, so thread injection isnt a concern . Rather, in the .NET Framework 4, PLINQ has an internally hardwired limit on the number of data elements in an 31 input data source that are supported: 2 , or 2,147,483,648. This means that PLINQ should only be used for streaming scenarios where fewer than this number of elements will be processed. In most scenarios, this limit should not be problematic. Consider a scenario where each element takes one millisecond to process. It would take at least 24 days at that rate of processing to exhaust this element space. If this limit does prove troublesome, 31 however, in many cases there is a valid mitigation. The limit of 2 elements is per execution of a query, so a potential solution is to simply restart the query after a certain number of items has been fed into the query. Consider a query of the form:

Visual Basic
_streamingData.GetConsumingEnumerable().AsParallel() .OtherOperators() .ForAll(Sub(x) Process(x))

We need two things, a loop around the query so that when one query ends, we start it over again, and an operator 31 that only yields the first N elements from the source, where N is chosen to be less than the 2 limit. LINQ already provides us with the latter, in the form of the Take operator. Thus, a workaround would be to rewrite the query as follows:

Visual Basic
While True _streamingData.GetConsumingEnumerable().Take(2000000000).AsParallel() .OtherOperators() .ForAll(x => Process(x)) End While

An additional caveat for PLINQ is that not all operators may be used in a streaming query, due to how those operators behave. For example, OrderBy performs a sort and releases items in sorted order. OrderBy has no way of knowing whether the items it has yet to consume from the source are less than the smallest item seem thus far, and thus it cant release any elements until its seen all elements from the source . With an infinite source, as is the case with a streaming input, that will never happen.

Patterns of Parallel Programming

Page 88

PARALLELWHILENOTEMPTY
Theres a fairly common pattern that emerges when processing some data structures: the processing of an element yields additional work to be processed. We can see this with the tree-walk example shown earlier in this document: processing one node of the tree may yield additional work to be processed in the form of that nodes children. Similarly in processing a graph data structure, processing a node may yield additional work to be processed in the form of that nodes neighbors. Several parallel frameworks include a construct focused on processing these kinds of workloads. No such construct is included in the .NET Framework 4, however its straightforward to build one. There are a variety of ways such a solution may be coded. Heres one:

Visual Basic
Public Shared Sub ParallelWhileNotEmpty(Of T)( ByVal initialValues As IEnumerable(Of T), ByVal body As Action(Of T, Action(Of T))) Dim From = New ConcurrentQueue(Of T)(initialValues) Dim [to] = New ConcurrentQueue(Of T)() Do While Not From.IsEmpty Dim addMethod As Action(Of T) = Sub(v) [to].Enqueue(v) Parallel.ForEach(From, Sub(v) body(v, addMethod)) From = [to] [to] = New ConcurrentQueue(Of T)() Loop End Sub

This solution is based on maintaining two lists of data: the data currently being processed (the from queue), and the data generated by the processing of the current data (the to queue). The initial values to be processed are stored into the first list. All those values are processed, and any new values they create are added to the second list. Then the second list is processed, and any new values that are produced go into a new list (or alternatively the first list cleared out). Then that list is processed, and... so on. This continues until the next list to be processed has no values available. With this in place, we can rewrite our tree walk implementation shown previously:

Visual Basic
Shared Sub Walk(Of T)(ByVal root As Tree(Of T), ByVal action As Action(Of T)) If root Is Nothing Then Return ParallelWhileNotEmpty({root}, Sub(item, adder) If item.Left IsNot Nothing Then adder(item.Left) If item.Right IsNot Nothing Then adder(item.Right) action(item.Data) End Sub) End Sub

Patterns of Parallel Programming

Page 89

ANTI-PATTERNS BLOCKING DEPENDENCIES BETWEEN PARTITIONED CHUNKS

As mentioned, there are several ways ParallelWhileNotEmpty could be implemented. Another approach to implementing ParallelWhileNotEmpty combines two parallel patterns weve previously seen in this document: counting up and down, and streaming. Simplistically, we can use a Parallel.ForEach over a BlockingCollections GetConsumingEnumerable, and allow the body of the ForEach to add more items into the BlockingCollection. The only thing missing, then, is the ability to mark the collection as complete for adding, which we only want to do after the last element has been processed (since the last element may result in more elements being added). To accomplish that, we keep track of the number of elements remaining to complete processing; every time the adder operation is invoked, we increase this count, and every time we complete the processing of an item we decrease it. If the act of decreasing it causes it to reach 0, were done, and we can mark the collection as complete for adding so that all threads involved in the ForEach will wake up.

Visual Basic
Shared Sub ParallelWhileNotEmpty(Of T)( ByVal source As IEnumerable(Of T), ByVal body As Action(Of T, Action(Of T))) Dim queue = New ConcurrentQueue(Of T)(source) If queue.IsEmpty Then Return Dim remaining = New CountdownEvent(queue.Count) Dim bc = New BlockingCollection(Of T)(queue) Dim adder As Action(Of T) = Sub(item) remaining.AddCount() bc.Add(item) End Sub Dim options = New ParallelOptions With { .MaxDegreeOfParallelism = Environment.ProcessorCount} Parallel.ForEach(bc.GetConsumingEnumerable(), options, Sub(item) Try body(item, adder) Finally If remaining.Signal() Then bc.CompleteAdding() End Try End Sub) End Sub

Unfortunately, this implementation has a devious bug in it, one that will likely result in deadlock close to the end of its execution such that ParallelWhileNotEmpty will never return. The issue has to do with partitioning. Parallel.ForEach uses multiple threads to process the supplied data source (in this case, the result of calling bc.GetConsumingEnumerable), and as such the data from that source needs to be dispensed to those threads. By default, Parallel.ForEach does this by having its threads take a lock, pull some number of elements from the source, release the lock, and then process those items. This is a performance optimization for the general case, where the number of trips back to the data source and the number of times the lock must be acquired and released is minimized. However, its then also very important that the processing of elements not have dependencies between them. Consider a very simple example: Patterns of Parallel Programming Page 90

Visual Basic
Dim mres = New ManualResetEventSlim() Parallel.ForEach(Enumerable.Range(0, 10), Sub(i) If i = 7 Then mres.[Set]() Else mres.Wait() End If End Sub)

Theoretically, this code could deadlock. All iterations have a dependency on iteration #7 executing, and yet the same thread that executed one of the other iterations may be the one destined to execute #7. To see this, consider a potential partitioning of the input data [0,10), where every thread grabs two elements at a time:

Here, the same thread grabbed both elements 6 and 7. It then processes 6, which immediately blocks waiting for an event that will only be set when 7 is processed, but 7 wont ever be processed, because the thread that would process it is blocked processing 6. Back to our ParallelWhileNotEmpty example, a similar issue exists there but is less obvious. The last element to be processed marks the BlockingCollection as complete for adding, which will cause any threads waiting on the empty collection to wake up, aware that no more data will be coming. However, threads are pulling multiple data elements from the source on each go around, and are not processing the elements from that chunk until the chunk contains a certain number of elements. Thus, a thread may grab what turns out to be the last element, but then continues to wait for more elements to arrive before processing it; however, only when that last element is processed will the collection signal to all waiting threads that there wont be any more data, and we have a deadlock. We can fix this by modifying the partitioning such that every thread only goes for one element at a time. That has the downside of resulting in more overhead per element, since each element will result in a lock being taken, but it has the serious upside of not resulting in deadlock. To control that, we can supply a custom partitioner that provides this functionality. The parallel programming samples for the .NET Framework 4, available for download at http://code.msdn.microsoft.com/ParExtSamples includes a ChunkPartitioner capable of yielding a single element at a time. Taking advantage of that, we get the following fixed solution:

Patterns of Parallel Programming

Page 91

Dim adder As Action(Of T) = Sub(t) remaining.AddCount() bc.Add(t) End Sub Dim options = New ParallelOptions With { .MaxDegreeOfParallelism = Environment.ProcessorCount} Parallel.ForEach(ChunkPartitioner.Create( bc.GetConsumingEnumerable(), 1), options, Sub(item) Try body(item, adder) Finally If remaining.Signal() Then bc.CompleteAdding() End Try End Sub) End Sub

Patterns of Parallel Programming

Page 92

SPECULATIVE PROCESSING
Speculation is the pattern of doing something that may not be needed in case it actually is needed. This is increasing relevant to parallel computing, where we can take advantage of multiple cores to do more things in advance of their actually being needed. Speculation trades off throughput for reduced latency, by utilizing resources to do more work in case that extra work could pay dividends.

THERE CAN BE ONLY ONE

There are many scenarios where multiple mechanisms may be used to compute a result, but how long each mechanism will take cant be predicted in advance. With serial computing, youre forced to pick one and hope that its the fastest. With parallel computing, we can theoretically run them all in parallel: once we have a winner, we can stop running the rest of the operations. We can encapsulate this functionality into a SpeculativeInvoke operation. SpeculativeInvoke will take a set of functions to be executed, and will start executing them in parallel until at least one returns.

Visual Basic
Public Shared Function SpeculativeInvoke(Of T)( ByVal ParamArray [functions]() As Func(Of T)) As T End Function

As mentioned earlier in the section on parallel loops, its possible to implement Invoke in terms of ForEach we can do the same here for SpeculativeInvoke:

Visual Basic
Public Shared Function SpeculativeInvoke(Of T)( ByVal ParamArray [functions]() As Func(Of T)) As T Return SpeculativeForEach([functions], Function([function]) [function]()); End Function

Now all we need is a SpeculativeForEach.

SPECULATIVEFOREACH USING PARALLEL.FOREACH

With ForEach, the goal is to process every item. With SpeculativeForEach, the goal is to get just one result, executing as many items as we can in parallel in order to get just one to return.

Visual Basic
Public Shared Function SpeculativeForEach(Of TSource, TResult)( ByVal source As IEnumerable(Of TSource), ByVal body As Func(Of TSource, TResult)) As TResult Dim result As Object = Nothing Parallel.ForEach(source, Sub(item, loopState) result = body(item) loopState.Stop() End Sub) Return CType(result, TResult) End Function

Patterns of Parallel Programming

Page 93

We take advantage of Parallel.ForEachs support for breaking out of a loop early, using ParallelLoopState.Stop. This tells the loop to try not to start any additional iterations. When we get a result from an iteration, we store it, request that the loop stop as soon as possible, and when the loop is over, return the result. A SpeculativeParallelFor could be implemented in a very similar manner. Note that we store the result as an object, rather than as a TResult. This is to accommodate value types. With multiple iterations executing in parallel, its possible that multiple iterations may try to write out a result concurrently. With reference types, this isnt a problem, as the CLR ensures that all of the data in a reference is written atomically. But with value types, we could potentially experience torn writes, where portions of the results from multiple iterations get written, resulting in an incorrect result. As noted, when an iteration completes it does not terminate other currently running iterations, it only works to prevent additional iterations from starting. If we want to update the implementation to also make it possible to cancel currently running iterations, we can take advantage of the .NET Framework 4 CancellationToken type. The idea is that well pass a CancellationToken into all functions, and the functions themselves may monitor for cancellation, breaking out early if cancellation was experienced.

Visual Basic
Public Function SpeculativeForEach(Of TSource, TResult)( ByVal source As IEnumerable(Of TSource), ByVal body As Func(Of TSource, CancellationToken, TResult)) As TResult Dim cts = New CancellationTokenSource() Dim result As Object = Nothing Parallel.ForEach(source, Sub(item, loopState) Try result = body(item, cts.Token) loopState.Stop() cts.Cancel() Catch ex As Exception End Try End Sub) Return CType(result, TResult) End Function

SPECULATIVEFOREACH USING PLINQ

We can also achieve this kind of speculative processing utilizing PLINQ. The goal of SpeculativeForEach is to select the result of the first function to complete, an operation which maps very nicely to PLINQs Select and First operators. We can thus re-implement SpeculativeForEach with very little PLINQ-based code:

Visual Basic
Public Shared Function SpeculativeForEach(Of TSource, TResult)( ByVal source As IEnumerable(Of TSource), ByVal body As Func(Of TSource, CancellationToken, TResult)) As TResult Dim cts = New CancellationTokenSource() Dim result As Object = Nothing

Patterns of Parallel Programming

Page 94

Parallel.ForEach(source, Sub(item, loopState) Try result = body(item, cts.Token) loopState.Stop() cts.Cancel() Catch ignored As OperationCanceledException End Try End Sub) Return CType(result, TResult) End Function

FOR THE FUTURE

The other large classification of speculative processing is around anticipation: an application can anticipate a need, and do some computation based on that guess. Prefetching, common in hardware and operating systems, is an example of this. Based on past experience and heuristics, the system anticipates that the program is going to need a particular resource and thus preloads that resource so that its available by the time its needed. If the system guessed correctly, the end result is improved perceived performance. Task<TResult> in the .NET Framework 4 makes it very straightforward to implement this kind of logic. When the system anticipates a particular computations result may be needed, it launches a Task<TResult> to compute the result.

Visual Basic
Dim cts = New CancellationTokenSource() Dim dataForThefuture As Task(Of Integer) = Task.Factory.StartNew( Function() ComputeSomeResult(), cts.Token)

If it turns out that result is not needed, the task may be canceled.

Visual Basic
Cancel it and make sure we are made aware of any exceptions that occurred. cts.Cancel() dataForTheFuture.ContinueWith(Function(t) LogException(dataForTheFuture), TaskContinuationOptions.OnlyOnFaulted)

If it turns out it is needed, its Result may be retrieved.

Visual Basic
This will return the value immediately if the Task has already completed, or will wait for the result to be available if its not yet completed. Dim result As Integer = dataForTheFuture.Result

Patterns of Parallel Programming

Page 95

LAZINESS
Programming is one of few professional areas where laziness is heralded. As we write software, we look for ways to improve performance, or at least to improve perceived performance, and laziness helps in both of these regards. Lazy evaluation is all about delaying a computation such that its not evaluated until its needed . In doing so, we may actually get away with never evaluating it at all, since it may never be needed. And other times, we can make the cost of evaluating lots of computations pay-for-play by only doing those computations when theyre needed and not before. (In a sense, this is the opposite of speculative computing, where we may start computations asynchronously as soon as we think they may be needed, in order to ensure the results are available if theyre needed.) Lazy evaluation is not something at all specific to parallel computing. LINQ is heavily based on a lazy evaluation model, where queries arent executed until MoveNext is called on an enumerator for the query. Many types lazilyload data, or lazily initialize properties. Where parallelization comes into play is in making it possible for multiple threads to access lazily-evaluated data in a thread-safe manner.

ENCAPSULATING LAZINESS
Consider the extremely common pattern for lazily-initializing some property on a type:

Visual Basic
Public Class MyLazy(Of T As Class) Private _value As T Public ReadOnly Property Value() As T Get If _value Is Nothing Then _value = Compute() Return _value End Get End Property Private Shared Function Compute() As T '... End Function End Class

Here, the _value field needs to be initialized to the result of some function Compute. _value could have been initialized in the constructor of MyLazy<T>, but that would have forced the user to incur the cost of computing _value, even if the Value property is never accessed. Instead, the Value propertys get accessor checks to see whether _value has been initialized, and if it hasnt, initializes it before returning _value. The initialization check happens by comparing _value to null, hence the class restriction on T, since a struct may never be null. Unfortunately, this pattern breaks down if the Value property may be accessed from multiple threads concurrently. There are several common patterns for dealing with this predicament. The first is through locking:

Visual Basic
Public Class MyLazy(Of T As Class) Private _syncObj As New Object()

Patterns of Parallel Programming

Page 96

Private _value As T Public ReadOnly Property Value() As T Get SyncLock _syncObj If _value Is Nothing Then _value = Compute() Return _value End SyncLock End Get End Property Private Shared Function Compute() As T '... End Function End Class

Now, the Value property is thread-safe, such that only one thread at a time will execute the body of the get accessor. Unfortunately, we also now force every caller of Value to accept the cost of taking a lock, even if Value has already previously been initialized. To work around that, theres the classic double -checked locking pattern:

Visual Basic
Public Class MyLazy(Of T As Class) Private _syncObj As New Object() Private _value As T Public Property Value() As T Get If _value Is Nothing Then SyncLock _syncObj If _value Is Nothing Then _value = Compute() End SyncLock End If Return _value End Get End Property Private Shared Function Compute() As T '... End Function End Class

This is starting to get complicated, with much more code having to be written than was necessary for the initial non-thread-safe version. Moreover, we havent factored in the complications of exception handling, supporting value types in addition to reference types (and having to deal with potential torn reads and torn writes) , cases where null is a valid value, and more. To simplify this, all aspects of the pattern, including the synchronization to ensure thread-safety, have been codified into the new .NET Framework 4 System.Lazy<T> type. We can re-write the code using Lazy<T> as follows:

Visual Basic
Public Class MyLazy(Of T) Private _value As New Lazy(Of T)(AddressOf Compute) Public ReadOnly Property Value() As T Get Return _value.Value

Patterns of Parallel Programming

Page 97

End Get End Property Private Shared Function Compute() As T '... End Function End Class

Lazy<T> supports the most common form of thread-safe initialization through a simple-to-use interface. If more control is needed, the Shared methods on System.Threading.LazyInitializer may be employed. The double-checked locking pattern supported by Lazy<T> is also supported by LazyInitializer, but through a single Shared method:

Visual Basic
Public Shared Function EnsureInitialized(Of T)( ByRef target As T, ByRef initialized As Boolean, ByRef [syncLock] As Object, ByVal valueFactory As Func(Of T)) As T

This overload allows the developer to specify the target reference to be initialized as well as a Boolean value that signifies whether initialization has been completed. It also allows the developer to explicitly specify the monitor object to be used for synchronization. Being able to explicitly specify the synchronization object allows multiple initialization routines and fields to be protected by the same lock. We can use this method to re-implement our previous examples as follows:

Visual Basic
Public Class MyLazy(Of T As Class) Private _syncObj As New Object() Private _initialized As Boolean Private _value As T Public ReadOnly Property Value() As T Get Return LazyInitializer.EnsureInitialized( _value, _initialized, _syncObj, AddressOf Compute) End Get End Property Private Shared Function Compute() As T '... End Function End Class

This is not the only pattern supported by LazyInitializer, however. Another less-common thread-safe initialization pattern is based on the principle that the initialization function is itself thread-safe, and thus its okay for it to be executed concurrently with itself. Given that property, we no longer need to use a lock to ensure that only one thread at a time executes the initialization function. However, we still need to maintain the invariant that the value being initialized is only initialized once. As such, while the initialization function may be run multiple times

Patterns of Parallel Programming

Page 98

concurrently in the case of multiple threads racing to initialize the value, one and only one of the resulting values must be published for all threads to see. If we were writing such code manually, it might look as follows:

Visual Basic
Public Class MyLazy(Of T As Class) Private _value As T Public ReadOnly Property Value() As T Get If _value Is Nothing Then Dim temp As T = Compute() Interlocked.CompareExchange(_value, temp, Nothing) End If Return _value End Get End Property Private Shared Function Compute() As T '... End Function End Class

LazyInitializer provides an overload to support this pattern as well:

Visual Basic
Public Function EnsureInitialized(Of T As Class)( ByRef target As T, ByVal valueFactory As Func(Of T)) As T

With this method, we can re-implement the same example as follows:

Visual Basic
Public Class MyLazy(Of T As Class) Private _value As T Public ReadOnly Property Value() As T Get Return LazyInitializer.EnsureInitialized(_value, AddressOf Compute) End Get End Property Private Shared Function Compute() As T '... End Function End Class

Its worth noting that in these cases, if the Compute function returns null, _value will be set to null, which is indistinguishable from Compute never having been run, and as a result the next time Values get accessor is invoked, Compute will be executed again.

Patterns of Parallel Programming

Page 99

ASYNCHRONOUS LAZINESS
Another common pattern centers around a need to lazily-initialize data asynchronously and to receive notification when the initialization has completed. This can be accomplished by marrying two types weve already seen: Lazy<T> and Task<TResult>. Consider an initialization routine:

Visual Basic
Function Compute() As T End Function

We can create a Lazy<T> to provide the result of this function:

Visual Basic
Dim data As New Lazy(Of T)(AddressOf Compute)

However, now when we access data.Value, were blocked waiting for the Compute operation to complete. Instead, for asynchronous lazy initialization, wed like to delay the computation until we know well need it, but once we do we also dont want to block waiting for it to complete. That latter portion should hint at using a Task<TResult>:

Visual Basic
Dim data As Task(Of T) = Task(Of T).Factory.StartNew(AddressOf Compute)

Combining the two, we can use a Lazy<Task<T>> to get both the delayed behavior and the asynchronous behavior:

Visual Basic
Dim data = New Lazy(Of Task(Of T))( Function() Task(Of T).Factory.StartNew(AddressOf Compute))

Now when we access data.Value, we get back a Task<T> that represents the running of Compute. No matter how many times we access data.Value, well always get back the same Task, even if accessed from multiple threads concurrently, thanks to support for the thread-safety patterns built into Lazy<T>. This means that only one Task<T> will be launched for Compute. Moreover, we can now use this result as we would any other Task<T>, including registering continuations with it (using ContinueWith) in order to be notified when the computation is complete:

Visual Basic
data.Value.ContinueWith(Function(t) Dim result As T= t.Result UseResult(result) End Function)

This approach can also be combined with multi-task continuations to lazily-initialize multiple items, and to only do work with those items when theyve all completed initialization:

Visual Basic
Private _data1 As New Lazy(Of Task(Of T))( Function() Task(Of T).Factory.StartNew(AddressOf Compute1))

Patterns of Parallel Programming

Page 100

Private _data2 As New Lazy(Of Task(Of T))( Function() Task(Of T).Factory.StartNew(AddressOf Compute2)) Private _data3 As New Lazy(Of Task(Of T))( Function() Task(Of T).Factory.StartNew(AddressOf Compute3)) '... Task.Factory.ContinueWhenAll( {_data1.Value, _data2.Value, _data3.Value}, Sub(tasks) UseResults( _data1.Value.Result, _data2.Value.Result, _data3.Value.Result))

Such laziness is also useful for certain patterns of caching, where we want to maintain a cache of these lazilyinitialized values. Consider a non-thread-safe cache like the following:

Visual Basic
Public Class Cache(Of TKey, TValue) Private ReadOnly _valueFactory As Func(Of TKey, TValue) Private ReadOnly _map As Dictionary(Of TKey, TValue) Public Sub New(ByVal valueFactory As Func(Of TKey, TValue)) If valueFactory Is Nothing Then Throw New ArgumentNullException("loader") _valueFactory = valueFactory _map = New Dictionary(Of TKey, TValue)() End Sub Public Function GetValue(ByVal key As TKey) As TValue If key Is Nothing Then Throw New ArgumentNullException("key") Dim val As TValue If Not _map.TryGetValue(key, val) Then val = _valueFactory(key) _map.Add(key, val) End If Return val End Function End Class

The cache is initialized with a function that produces a value based on a key supplied to it. Whenever the value of a key is requested from the cache, the cache returns the cached value for the key if one is available in the internal dictionary, or it generates a new value using the caches _valueFactory function, stores that value for later, and returns it. We now want an asynchronous version of this cache. Just like with our asynchronous laziness functionality, we can represent this as a Task<TValue> rather than simply as a TValue. Multiple threads will be accessing the cache concurrently, so we want to use a ConcurrentDictionary<TKey,TValue> instead of a Dictionary<TKey,TValue> (ConcurrentDictionary<> is a new map type available in the .NET Framework 4, supporting multiple readers and writers concurrently without corrupting the data structure).

Visual Basic
Public Class AsyncCache(Of TKey, TValue) Private ReadOnly _valueFactory As Func(Of TKey, Task(Of TValue)) Private ReadOnly _map As ConcurrentDictionary( Of TKey, Lazy(Of Task(Of TValue))) Public Sub New(ByVal valueFactory As Func(Of TKey, Task(Of TValue)))

Patterns of Parallel Programming

Page 101

If valueFactory Is Nothing Then Throw New ArgumentNullException("loader") _valueFactory = valueFactory _map = New ConcurrentDictionary(Of TKey, Lazy(Of Task(Of TValue)))() End Sub Public Function GetValue(ByVal key As TKey) As Task(Of TValue) If key Is Nothing Then Throw New ArgumentNullException("key") Return _map.GetOrAdd(key, Function(k) Return New Lazy(Of Task(Of TValue))(Function() _valueFactory(k)) End Function).Value End Function End Class

The function now returns a Task<TValue> instead of just TValue, and the dictionary stores Lazy<Task<TValue>> rather than just TValue. The latter is done so that if multiple threads request the value for the same key concurrently, only one task for that value will be generated. Note the GetOrAdd method on ConcurrentDictionary. This method was added in recognition of a very common coding pattern with dictionaries, exemplified in the earlier synchronous cache example. Its quite common to want to check a dictionary for a value, returning that value if it could be found, otherwise creating a new value, adding it, and returning it, as exemplified in the following example:

Visual Basic
<Extension()> Public Function GetOrAdd(Of TKey, TValue)( ByVal dictionary As Dictionary(Of TKey, TValue), ByVal key As TKey, ByVal valueFactory As Func(Of TKey, TValue)) As TValue Dim value As TValue If Not dictionary.TryGetValue(key, value) Then value = valueFactory(key) dictionary.Add(key, value) End If Return value End Function

This pattern has been codified into ConcurrentDictionary in a thread-safe manner in the form of the GetOrAdd method. Similarly, another coding pattern thats quite common with dictionaries is around checking for an existing value in the dictionary, updating that value if it could be found or adding a new one if it couldnt .

Visual Basic
<Extension()> Public Function AddOrUpdate(Of TKey, TValue)( ByVal dictionary As Dictionary(Of TKey, TValue), ByVal key As TKey, ByVal addValueFactory As Func(Of TKey, TValue), ByVal updateValueFactory As Func(Of TKey, TValue, TValue)) As TValue Dim value As TValue value = If(dictionary.TryGetValue(key, value), updateValueFactory(key, value), addValueFactory(key)) dictionary(key) = value Return value End Function

Patterns of Parallel Programming

Page 102

This pattern has been codified into ConcurrentDictionary in a thread-safe manner in the form of the AddOrUpdate method.

Patterns of Parallel Programming

Page 103

SHARED STATE
Dealing with shared state is arguably the most difficult aspect of building parallel applications and is one of the main sources of both correctness and performance problems. There are several ways of dealing with shared state, including synchronization, immutability, and isolation. With synchronization, the shared state is protected by mechanisms of mutual exclusion to ensure that the data remains consistent in the face of multiple threads accessing and modifying it. With immutability, shared data is read-only, and without being modified, theres no danger in sharing it. With isolation, sharing is avoided, with threads utilizing their own isolated state thats not available to other threads.

ISOLATION & THREAD-LOCAL STATE

Thread-local state is a very common mechanism for supporting isolation, and there are several reasons why you might want to use thread-local state. One is to pass information out-of-band between stack frames. For example, the System.Transactions.TransactionScope class is used to register some ambient information for the current thread, such that operations (for example, commands against a database) can automatically enlist in the ambient transaction. Another use of thread-local state is to maintain a cache of data per thread rather than having to synchronize on a shared data source. For example, if multiple threads need random numbers, each thread can maintain its own Random instance, accessing it freely and without concern for another thread accessing it concurrently; an alternative would be to share a single Random instance, locking on access to it. Thread-local state is exposed in the .NET Framework 4 in three different ways. The first way, and the most efficient, is through the ThreadStaticAttribute. By applying [ThreadStatic] to a Shared field of a type, that field becomes a thread-local static, meaning that rather than having one field of storage per AppDomain (as you would with a traditional static/Shared), theres one field of storage per thread per AppDomain . Hearkening back to our randomness example, you could imagine trying to initialize a ThreadStatic Random field as follows:

Visual Basic
<ThreadStatic()> Private Shared _rand As New Random() ' WARNING: buggy Shared Function GetRandomNumber() As Integer Return _rand.Next() End Function

Unfortunately, this wont work as expected. The C# and Visual Basic compilers extract initialization for static/Shared members into a Shared constructor for the containing type, and a Shared constructor is only run once. As such, this initialization code will only be executed for one thread in the system, leaving the rest of the threads with _rand initialized to null. To account for this, we need to check prior to accessing _rand to ensure its been initialized, invoking the initialization code on each access if it hasnt been:

Visual Basic
<ThreadStatic()> Private Shared _rand As Random Shared Function GetRandomNumber() As Integer If _rand Is Nothing Then _rand = New Random()

Patterns of Parallel Programming

Page 104

Return _rand.Next() End Function

Any thread may now call GetRandomNumber, and any number of threads may do so concurrently; each will end up utilizing its own instance of Random. Another issue with this approach is that, unfortunately, [ThreadStatic] may only be used with statics/Shareds. Applying this attribute to an instance member is a no-op, leaving us in search of another mechanism for supporting per-thread, per-instance state. Since the original release of the .NET Framework, thread-local storage has been supported in a more general form through the Thread.GetData and Thread.SetData Shared methods. The Thread.AllocateDataSlot and Thread.AllocateNamedDataSlot Shared methods may be used to create a new LocalDataStoreSlot, representing a single object of storage. The GetData and SetData methods can then be used to get and set that object for the current thread. Re-implementing our previous Random example could be done as follows:

Visual Basic
Private Shared _randSlot As LocalDataStoreSlot = Thread.AllocateDataSlot() Shared Function GetRandomNumber() As Integer Dim rand = CType(Thread.GetData(_randSlot), Random) If rand Is Nothing Then rand = New Random() Thread.SetData(_randSlot, rand) End If Return rand.Next() End Function

However, since our thread-local storage is now represented as an object (LocalDataStoreSlot) rather than as a Shared field, we can use this mechanism to achieve the desired per-thread, per-instance data:

Visual Basic
Public Class MyType Private _rand As LocalDataStoreSlot = Thread.AllocateDataSlot() Public Function GetRandomNumber() As Integer Dim r = CType(Thread.GetData(_rand), Random) If r Is Nothing Then r = New Random() Thread.SetData(_rand, r) End If Return r.Next() End Function End Class

While flexible, this approach also has downsides. First, Thread.GetData and Thread.SetData work with type Object rather than with a generic type parameter. In the best case, the data being stored is a reference type, and we only need to cast to retrieve data from a slot, knowing in advance what kind of data is stored in that slot. In the worst case, the data being stored is a value type, forcing an object allocation every time the data is modified, as the value type gets boxed when passed into the Thread.SetData method. Another issue is around performance. The ThreadStaticAttribute approach has always been significantly faster than the Thread.GetData/SetData approach, and while both mechanisms have been improved for the .NET Framework 4, the ThreadStaticAttribute approach is still an order of magnitude faster. Finally, with Thread.GetData/SetData, the reference to the storage and the

Patterns of Parallel Programming

Page 105

capability for accessing that storage are separated out into individual APIs, rather than being exposed in a convenient manner that combines them in an object-oriented manner. To address these shortcomings, the .NET Framework 4 introduces a third thread-local storage mechanism: ThreadLocal<T>. ThreadLocal<T> addresses the shortcomings outlined: ThreadLocal<T> is generic. Its Value property is typed as T and the data is stored in a generic manner. This eliminates the need to cast when accessing the value, and it eliminates the boxing that would otherwise occur if T were a value type. The constructor for ThreadLocal<T> optionally accepts a Func<T> delegate. This delegate can be used to initialize the thread-local value on every accessing thread. This alleviates the need to explicitly check on every access to ThreadLocal<T>.Value whether its been initialized yet. ThreadLocal<T> encapsulates both the data storage and the mechanism for accessing that storage. This simplifies the pattern of accessing the storage, as all thats required is to utilize the Value property. ThreadLocal<T>.Value is fast. ThreadLocal<T> has a sophisticated implementation based on ThreadStaticAttribute that makes the Value property more efficient than Thread.GetData/SetData.

ThreadLocal<T> is still not as fast as ThreadStaticAttribute, so if ThreadStaticAttribute fits your needs well and if access to thread-local storage is a bottleneck on your fast path, it should still be your first choice. Additionally, a single instance of ThreadLocal<T> consumes a few hundred bytes, so you need to consider how many of these you want active at any one time. Regardless of what mechanism for thread-local storage you use, if you need thread-local storage for several successive operations, its best to work on a local copy so as to avoid accessing thread -local storage as much as possible. For example, consider adding two vectors stored in thread-local storage:

Visual Basic
Private Const VECTOR_LENGTH As Integer = 1000000 Private _vector1 As New ThreadLocal(Of Integer())( Function() New Integer(VECTOR_LENGTH - 1) {}) Private _vector2 As New ThreadLocal(Of Integer())( Function() New Integer(VECTOR_LENGTH - 1) {}) ' ... Private Sub DoWork() For i = 0 To VECTOR_LENGTH - 1 _vector2.Value(i) += _vector1.Value(i) Next i End Sub

While the cost of accessing ThreadLocal<T>.Value has been minimized as best as possible in the implementation, it still has a non-negligible cost (the same is true for accessing ThreadStaticAttribute). As such, its much better to rewrite this code as follows:

Visual Basic
Private Dim Dim For Sub DoWork() vector1 = _vector1.Value vector2 = _vector2.Value i = 0 To VECTOR_LENGTH - 1 vector2(i) += vector1(i) Next i

Patterns of Parallel Programming

Page 106

_vector2.Value = vector2 End Sub

Returning now to our previous example of using a thread-local Random, we can take advantage of ThreadLocal<T> to implement this support in a much more concise manner:

Visual Basic
Public Class MyType Private _rand As New ThreadLocal(Of Random)(Function() New Random()) Public Function GetRandomNumber() As Integer Return _rand.Value.Next() End Function End Class

Earlier in this document, it was mentioned that the ConcurrentBag<T> data structure maintains a list of instances of T per thread. This is achieved internally using ThreadLocal<>.

SYNCHRONIZATION
In most explicitly-threaded parallel applications, no matter how much we try, we end up with some amount of shared state. Accessing shared state from multiple threads concurrently requires that either that shared state be immutable or that the application utilize synchronization to ensure the consistency of the data.

RELIABLE LOCK ACQUISITION

By far, the most prevalent pattern for synchronization in the .NET Framework is in usage of the lock keyword in C# and the SyncLock keyword in Visual Basic. Compiling down to usage of Monitor under the covers, this pattern manifests as follows:

Visual Basic
SyncLock someObject ' ... critical region of code End SyncLock

This code ensures that the work inside the critical region is executed by at most one thread at a time. In C# 3.0 and earlier and Visual Basic 9.0 and earlier, the above code was compiled down to approximately the equivalent of the following:

Visual Basic
Dim lockObj = someObject Monitor.Enter(lockObj) Try ' ... critical region of code Finally Monitor.[Exit](lockObj)

Patterns of Parallel Programming

Page 107

End Try

This code ensures that even in the case of exception, the lock is released when the critical region is done. Or at least its meant to. A problem emerges due to asynchronous exceptions: external influences may cause exceptions to occur on a block of code even if that exception is not explicitly stated in the code. In the extreme case, a thread abort may be injected into a thread between any two instructions, though not within a finally block except in extreme conditions. If such an abort occurred after the call to Monitor.Enter but prior to entering the try block, the monitor would never be exited, and the lock would be leaked. To help prevent against this, the just-in-time (JIT) compiler ensures that, as long as the call to Monitor.Enter is the instruction immediately before the try block, no asynchronous exception will be able to sneak in between the two. Unfortunately, its not always the case that these instructions are immediate neighbors. For example, in debug builds, the compiler uses nop instructions to support setting breakpoints in places that breakpoints would not otherwise be feasible. Worse, its often the case that developers want to enter a lock conditionally, such as with a timeout, and in such cases there are typically branching instructions between the call and entering the try block:

Visual Basic
If Monitor.TryEnter(someObject, 1000) Then Try ' ... critical region of code Finally Monitor.Exit(someObject) End Try Else '... End If

To address this, in the .NET Framework 4 new overloads of Monitor.Enter (and Monitor.TryEnter) have been added, supporting a new pattern of reliable lock acquisition and release:

Visual Basic
Public Shared Sub Enter(ByVal obj As Object, ByRef lockTaken As Boolean)

This overload guarantees that the lockTaken parameter is initialized by the time Enter returns, even in the face of asynchronous exceptions. This leads to the following new, reliable pattern for entering a lock:

Visual Basic
Dim lockTaken = False Try Monitor.Enter(someObject, lockTaken) ' ... critical region of code Finally If lockTaken Then Monitor.Exit(someObject) End Try

In fact, code similar to this is what the C# and Visual Basic compilers output in the .NET Framework 4 for the lock and SyncLock construct. This pattern applies equally to TryEnter, with only a slight modification:

Visual Basic
Dim lockTaken = False Try Monitor.TryEnter(someObject, 1000, lockTaken) If lockTaken Then

Patterns of Parallel Programming

Page 108

' ... critical region of code Else '... End If Finally If lockTaken Then Monitor.Exit(someObject) End Try

Note that the new System.Threading.SpinLock type also follows this new pattern, and in fact provides only the reliable overloads:

Visual Basic
Public Structure SpinLock Public Sub Enter(ByRef lockTaken As Boolean) Public Sub TryEnter(ByRef lockTaken As Boolean) Public Sub TryEnter(ByVal timeout As TimeSpan, ByRef lockTaken As Boolean) Public Sub TryEnter(ByVal millisecondsTimeout As Integer, ByRef lockTaken As Boolean) ' ... End Structure

With these methods, SpinLock is then typically used as follows:

Visual Basic
Private _lock As New SpinLock(enableThreadOwnerTracking:=False) ' ... Dim lockTaken As Boolean = False Try _lock.Enter(lockTaken) ' ... very small critical region here Finally If lockTaken Then _lock.Exit(useMemoryBarrier:=False) End Try

Alternatively, SpinLock may be used with TryEnter as follows:

Visual Basic
Dim lockTaken As Boolean = False Try _lock.TryEnter(lockTaken) If lockTaken Then ' ... very small critical region here Else ' ... End If Finally If lockTaken Then _lock.Exit(useMemoryBarrier:=False) End Try

The concept of a spin lock is that rather than blocking, it continually iterates through a loop (spinning), until the lock is available. This can lead to benefits in some cases, where contention on the lock is very infrequent, and where if there is contention, the lock will be available in very short order. This then allows the application to avoid costly kernel transitions and context switches, instead iterating through a loop a few times. When used at incorrect times, however, spin locks can lead to significant performance degradation in an application.

Patterns of Parallel Programming

Page 109

The constructor to SpinLock accepts an enableThreadOwnerTracking parameter, which default to true. This causes the SpinLock to keep track of which thread currently owns the lock, and can be useful for debugging purposes. This does, however, have an effect on the locks behavior when the lock is misused . SpinLock is not reentrant, meaning that a thread may only acquire the lock once. If thread holding the lock tries to enter it again, and if enableThreadOwnerTracking is true, the call to Enter will throw an exception. If enableThreadOwnerTracking is false, however, the call will deadlock, spinning forever. In general, if you need a lock, start with Monitor. Only if after performance testing do you find that Monitor isnt fitting the bill should SpinLock be considered. If you do end up using a SpinLock, inside the protected region you should avoid blocking or calling anything that may block, trying to acquire another lock, calling into unknown code (including calling virtual methods, interface methods, or delegates), and allocating memory. You should be able to count the number of instructions executed under a spin lock on two hands, with the total amount of CPU utilization in the protected region amounting to only tens of cycles.

MIXING EXCEPTIONS WITH LOCKS

As described, a lot of work has gone into ensuring that locks are properly released, even if exceptions occur within the protected region. This, however, isnt always the best behavior. Locks are used to make non-atomic sets of actions appear atomic, and thats often needed due to multiple statements making discrete changes to shared state. If an exception occurs inside of a critical region, that exception may leave shared data in an inconsistent state. All of the work weve done to ensure reliable lock release in the face of exceptions now leads to a problem: another thread may acquire the lock and expect state to be consistent, but find that its not. In these cases, we have a decision to make: is it better to allow threads to access potentially inconsistent state, or is it better to deadlock (which would be achievable by not releasing the lock, but by leaking it instead)? The answer really depends on the case in question. If you decide that leaking a lock is the best solution, instead of using the aforementioned patterns the following may be employed:

Visual Basic
Monitor.Enter(someObject) ' ... critical region Monitor.Exit(someObject)

Now if an exception occurs in the critical region, the lock will not be exited, and any other threads that attempt to acquire this lock will deadlock. Of course, due to the reentrancy supported by Monitor in the .NET Framework, if this same thread later attempts to enter the lock, it will succeed in doing so.

Patterns of Parallel Programming

Page 110

AVOIDING DEADLOCKS
Of all of the problems that may result from incorrect synchronization, deadlocks are one of the most well-known. There are four conditions required for a deadlock to be possible: 1. 2. 3. 4. Mutual exclusion. Only a limited number of threads may utilize a resource concurrently. Hold and wait. A thread holding a resource may request access to other resources and wait until it gets them. No preemption. Resources are released only voluntarily by the thread holding the resource. Circular wait. There is a set of {T1, , TN} threads, where T1 is waiting for a resource held by T2, T2 is waiting for a resource held by T3, and so forth, up through TN waiting for a resource held by T1.

If any one of these conditions doesnt hold, deadlock isnt possible . Thus, in order to avoid deadlock, we need to ensure that we avoid at least one of these. The most common and actionable condition to avoid in real-world code is #4, circular waits, and we can attack this condition in a variety of ways. One approach involves detecting that a cycle is about to occur. We can maintain a store of what threads hold what locks, and if a thread makes an attempt to acquire a lock that would lead to a cycle, we can prevent it from doing so; an example of this graph analysis is codified in the .NET Matters: Deadlock Monitor article at http://msdn.microsoft.com/enus/magazine/cc163352.aspx. There is another example in the article No More Hangs: Advanced Techniques To Avoid And Detect Deadlocks In .NET Apps by Joe Duffy at http://msdn.microsoft.com/enus/magazine/cc163618.aspx. That same article by Joe Duffy also includes an example of another approach: lock leveling. In lock leveling, locks are assigned numerical values, and the system tracks the smallest value lock held by a thread, only allowing the thread to acquire locks with smaller values than the smallest value it already holds; this prevents the potential for a cycle. In some cases, we can avoid cycles simply by sorting the locks utilized in some consistent way, and ensuring that if multiple locks need to be taken, theyre taken in sorted order (this is, in effect, a lock leveling scheme). We can see a simple example of this in an implementation of the classic dining philosophers problem. The dining philosophers problem was posited by Tony Hoare, based on previous examples from Edsger Dijkstra in the 1960s. The basic idea is that five philosophers sit around a table. Every philosopher has a plate of pasta, and between every pair of philosophers is a fork. To eat the pasta, a philosopher must pick up and use the forks on both sides of him; thus, if a philosophers neighbor is eating, the philosopher cant . Philosophers alternate between thinking and eating, typically for random periods of time.

Patterns of Parallel Programming

Page 111

We can represent each fork as a lock, and a philosopher must acquire both locks in order to eat. This would result in a solution like the following:

Visual Basic
' WARNING: THIS METHOD HAS A BUG Const NUM_PHILOSOPHERS = 5 Dim forks(NUM_PHILOSOPHERS - 1) As Object Dim philosophers = New Task(NUM_PHILOSOPHERS - 1) {} For i = 0 To NUM_PHILOSOPHERS - 1 Dim id = i philosophers(i) = Task.Factory.StartNew( Sub() Dim rand = New Random(id) Do ' Think Thread.Sleep(rand.Next(100, 1000)) ' Get forks Dim leftFork = forks(id) Dim rightFork = forks((id + 1) Mod NUM_PHILOSOPHERS) Monitor.Enter(leftFork) Monitor.Enter(rightFork) ' Eat Thread.Sleep(rand.Next(100, 1000)) ' Put down forks Monitor.Exit(rightFork)

Patterns of Parallel Programming

Page 112

Monitor.Exit(leftFork) Loop End Sub, TaskCreationOptions.LongRunning) Next i Task.WaitAll(philosophers)

Unfortunately, this implementation is problematic. If every philosopher were to pick up his left fork at the same time, all of the forks would be off the table. Each philosopher would then attempt to pick up the right fork and would need to wait indefinitely. This is a classic deadlock, following the exact circular wait condition previously described. To fix this, we can eliminate the cycle by ensuring that a philosopher first picks up the lower numbered fork and then the higher numbered fork, even if that means picking up the right fork first:

Visual Basic
Do ' Think Thread.Sleep(rand.Next(100, 1000)) ' Get forks in sorted order to avoid deadlock Dim firstForkId As Integer = id Dim secondForkId As Integer = (id + 1) Mod NUM_PHILOSOPHERS If secondForkId < firstForkId Then Swap(firstForkId, secondForkId) Dim firstFork = forks(firstForkId) Dim secondFork = forks(secondForkId) Monitor.Enter(firstFork) Monitor.Enter(secondFork) ' Eat Thread.Sleep(rand.Next(100, 1000)) ' Put down forks Monitor.Exit(secondFork) Monitor.Exit(firstFork) Loop

Another solution is to circumvent the second deadlock requirement, hold and wait, by utilizing the operating system kernels ability to acquire multiple locks atomically. To accomplish that, we need to forego usage of Monitor, and instead utilize one of the .NET Framework synchronization primitives derived from WaitHandle, such as Mutex. When we want to acquire both forks, we can then utilize WaitHandle.WaitAll to acquire both forks atomically. Using WaitAll, we block until weve acquired both locks, and no other thread will see us holding one lock but not the other.

Visual Basic
Const NUM_PHILOSOPHERS = 5 Dim forks() As Mutex = Enumerable.Range(0, NUM_PHILOSOPHERS) .Select(Function(i) New Mutex()) .ToArray() Dim philosophers = New Task(NUM_PHILOSOPHERS - 1) {} For i = 0 To NUM_PHILOSOPHERS - 1 Dim id = i philosophers(i) = Task.Factory.StartNew( Sub()

Patterns of Parallel Programming

Page 113

Dim rand = New Random(id) Do ' Think Thread.Sleep(rand.Next(100, 1000)) ' Get forks together atomically Dim leftFork = forks(id) Dim rightFork = forks((id + 1) Mod NUM_PHILOSOPHERS) WaitHandle.WaitAll({leftFork, rightFork}) ' Eat Thread.Sleep(rand.Next(100, 1000)) ' Put down forks; order of release doesnt matter leftFork.ReleaseMutex() rightFork.ReleaseMutex() Loop End Sub, TaskCreationOptions.LongRunning) Next i Task.WaitAll(philosophers)

The .NET Framework 4 parallel programming samples at http://code.msdn.microsoft.com/ParExtSamples contain several example implementations of the dining philosophers problem.

ANTI-PATTERNS LOCK(THIS) AND LOCK(TYPEOF(SOMETYPE))

Especially in code written early in the .NET Frameworks lifetime, it was common to see synchronization done in instance members with code such as:

Visual Basic
Sub SomeMethod() SyncLock Me ' ... critical region here End SyncLock End Sub

It was also common to see synchronization done in Shared members with code such as:

Visual Basic
Shared Sub SomeMethod() SyncLock GetType(Testing) ' ... critical region here End SyncLock End Sub

In general, this pattern should be avoided. Good object-oriented design results in implementation details remaining private through non-public state, and yet here, the locks used to protect that state are exposed. With these lock objects then public, it becomes possible for an external entity to accidentally or maliciously interfere with the internal workings of the implementation, as well as make common multithreading problems such as deadlocks more likely. (Additionally, Type instances can be domain agile, and a lock on a type in one AppDomain

Patterns of Parallel Programming

Page 114

may seep into another AppDomain, even if the state being protected is isolated within the AppDomain.) Instead and in general, non-public (and non-AppDomain-agile) objects should be used for locking purposes. The same guidance applies to MethodImplAttribute. The MethodImplAttribute accepts a MethodImplOptions enumeration value, one of which is Synchronized. When applied to a method, this ensures that only one thread at a time may access the attributed member:

Visual Basic
<MethodImpl(MethodImplOptions.Synchronized)> Sub SomeMethod() ' ... critical region here End Sub

However, it does so using the equivalent of the explicit locking code shown previously, with a lock on the instance for instance members and with a lock on the type for Shared members. As such, this option should be avoided.

READONLY SPINLOCK FIELDS

The readonly keyword informs the compiler that a field should only be updated by the constructor; any attempts to modify the field from elsewhere results in a compiler error. As such, you might be tempted to write code like the following:

Visual Basic
Private ReadOnly _lock As SpinLock ' WARNING!

Dont do this. Due to the nature of structs and how they interact with the readonly keyword, every access to this _lock field will return a copy of the SpinLock, rather than the original. As a result, every call to _lock.Enter will succeed in acquiring the lock, even if another thread thinks it owns the lock. For the same reason, dont pass try to pass SpinLocks around. In most cases, when you do so, youll be making a copy of the SpinLock. As an example, consider the desire to write an extension method for SpinLock that executes a user-provided delegate while holding the lock:

Visual Basic
' WARNING! DONT DO THIS. <Extension()> Public Sub Execute(ByVal sl As SpinLock, ByVal runWhileHoldingLock As Action) Dim lockWasTaken As Boolean = False Try sl.Enter(lockWasTaken) runWhileHoldingLock() Finally If lockWasTaken Then sl.Exit() End Try End Sub

Theoretically, this code should allow you to write code like:

Visual Basic
_lock.Execute( Sub()

Patterns of Parallel Programming

Page 115

' will be run while holding the lock End Sub)

However, the code is very problematic. The SpinLock being targeted by the method will be passed by value, such that the method will execute on a copy of the SpinLock rather than the original. To write such a method correctly, youd need to pass the SpinLock into the Execute method by reference, and C# doesnt permit an extension method to target a value passed by reference. Fortunately, Visual Basic does, and we could write this extension method correctly as follows:

Visual Basic
<Extension()> Public Sub Execute(ByRef sl As SpinLock, ByVal runWhileHoldingLock As Action) Dim lockWasTaken As Boolean Try sl.Enter(lockWasTaken) runWhileHoldingLock() Finally If lockWasTaken Then sl.Exit() End Try End Sub

See the blog post at http://blogs.msdn.com/pfxteam/archive/2009/05/07/9592359.aspx for more information about this dangerous phenomenon.

Patterns of Parallel Programming

Page 116

CONCLUSION
Understanding design and coding patterns as they relate to parallelism will help you to find more areas of your application that may be parallelized and will help you to do so efficiently. Knowing and understanding patterns of parallelization will also help you to significantly reduce the number of bugs that manifest in your code. Finally, using the new parallelization support in the .NET Framework 4 which encapsulate these patterns will not only help to reduce the bug count further, but it should help you to dramatically decrease the amount of time and code it takes to get up and running quickly and efficiently. Now, go forth and parallelize. Enjoy!

ACKNOWLEDGEMENTS
The author would like to thank the following people for their feedback on drafts of this paper: Donny Amalo, John Bristowe, Tina Burden, David Callahan, Chris Dern, Joe Duffy, Ed Essey, Lisa Feigenbaum, Boby George, Scott Hanselman, Jerry Higgins, Joe Hoag, Luke Hoban, Mike Liddell, Daniela Cristina Manu, Ade Miller, Pooja Nagpal, Jason Olson, Emad Omara, Igor Ostrovsky, Josh Phillips, Danny Shih, Cindy Song, Herb Sutter, Don Syme, Roy Tan, Ling Wo, and Huseyin Yildiz.

ABOUT THE AUTHOR

Stephen Toub is a Program Manager Lead on the Parallel Computing Platform team at Microsoft, where he spends his days focusing on the next generation of programming models and runtimes for concurrency. Stephen is also a Contributing Editor for MSDN Magazine, for which he writes the .NET Matters column, and is an avid speaker at conferences such as PDC, TechEd, and DevConnections. Prior to working on the Parallel Computing Platform, Stephen designed and built enterprise applications for companies such as GE, JetBlue, and BankOne. He was a developer for Microsoft Outlook as well as for the Microsoft Office Solution Accelerators. Stephen holds degrees in computer science from Harvard University and New York University.

This material is provided for informational purposes only. Microsoft makes no warranties, express or implied. 2010 Microsoft Corporation.

Patterns of Parallel Programming

Page 117

Learning Salesforce Development with Apex: Learn to Code, Run and Deploy Apex Programs for Complex Business Process and Critical Business Logic - 2nd Edition
From Everand
Learning Salesforce Development with Apex: Learn to Code, Run and Deploy Apex Programs for Complex Business Process and Critical Business Logic - 2nd Edition
Paul Battisson
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
KNIME Essentials
From Everand
KNIME Essentials
Gábor Bakos
No ratings yet
NNG Reference Manual, Second Edition
From Everand
NNG Reference Manual, Second Edition
Garrett D'Amore
No ratings yet
Security Lighting Guide
No ratings yet
Security Lighting Guide
73 pages
Fuel Injection Equipment (Diesel) (9166A056AFH02043)
No ratings yet
Fuel Injection Equipment (Diesel) (9166A056AFH02043)
4 pages
Patterns of Parallel Programming in C#
No ratings yet
Patterns of Parallel Programming in C#
118 pages
Preliminary
No ratings yet
Preliminary
169 pages
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell pdf download
100% (1)
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell pdf download
49 pages
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell - The newest ebook version is ready, download now to explore
100% (1)
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell - The newest ebook version is ready, download now to explore
49 pages
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell 2024 Scribd Download
100% (4)
Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell 2024 Scribd Download
82 pages
Instant Download (Ebook) Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures (Patterns & Practices) by Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub ISBN 9780735651593, 0735651590 PDF All Chapters
No ratings yet
Instant Download (Ebook) Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures (Patterns & Practices) by Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub ISBN 9780735651593, 0735651590 PDF All Chapters
82 pages
Download Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell ebook All Chapters PDF
100% (6)
Download Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell ebook All Chapters PDF
81 pages
Instant Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell ebook download PDF full chapters
100% (2)
Instant Parallel Programming with Microsoft NET Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns Practices 1st Edition Colin Campbell ebook download PDF full chapters
82 pages
Getting Started with Terraform
From Everand
Getting Started with Terraform
Kirill Shirinkin
5/5 (1)
Asynchronous Programming in Rust: Learn asynchronous programming by building working examples of futures, green threads, and runtimes
From Everand
Asynchronous Programming in Rust: Learn asynchronous programming by building working examples of futures, green threads, and runtimes
Carl Fredrik Samson
No ratings yet
Learn Rust Programming: Safe Code, Supports Low Level and Embedded Systems Programming with a Strong Ecosystem (English Edition)
From Everand
Learn Rust Programming: Safe Code, Supports Low Level and Embedded Systems Programming with a Strong Ecosystem (English Edition)
Claus Matzinger
No ratings yet
4976592
No ratings yet
4976592
81 pages
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Parallel Programming with Microsoft Visual C Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns and Practices 1st Edition Colin Campbell - Download the full set of chapters carefully compiled
100% (1)
Parallel Programming with Microsoft Visual C Design Patterns for Decomposition and Coordination on Multicore Architectures Patterns and Practices 1st Edition Colin Campbell - Download the full set of chapters carefully compiled
56 pages
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
From Everand
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
Tim Warren
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Massing Ill 4
No ratings yet
Massing Ill 4
29 pages
Implementing C# 11 and .NET 7.0: Learn how to build cross-platform apps with .NET Core (English Edition)
From Everand
Implementing C# 11 and .NET 7.0: Learn how to build cross-platform apps with .NET Core (English Edition)
Fiodar Sazanavets
No ratings yet
Infrastructure as Code with OpenTofu: A perfect Terraform alternative to manage compute, storage, networking and other infrastructure resources
From Everand
Infrastructure as Code with OpenTofu: A perfect Terraform alternative to manage compute, storage, networking and other infrastructure resources
Tyran Vosk
No ratings yet
Infrastructure as Code with OpenTofu
From Everand
Infrastructure as Code with OpenTofu
Tyran Vosk
No ratings yet
Learning Ansible
From Everand
Learning Ansible
Wayne Taylor
No ratings yet
.Net Framework and Programming in ASP.NET
From Everand
.Net Framework and Programming in ASP.NET
Priyanka Agarwal
No ratings yet
From Zero to Market with Flutter
From Everand
From Zero to Market with Flutter
Viachaslau Lyskouski
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
From Zero to Market with Flutter: Desktop, Mobile, and Web Distribution
From Everand
From Zero to Market with Flutter: Desktop, Mobile, and Web Distribution
Viachaslau Lyskouski
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Mastering Concurrent Programming with Go
From Everand
Mastering Concurrent Programming with Go
Brett Neutreon
No ratings yet
Programming APIs with C# and .NET: Develop high-performance APIs that ensure seamless application communication and enhanced security
From Everand
Programming APIs with C# and .NET: Develop high-performance APIs that ensure seamless application communication and enhanced security
Jesse Liberty
No ratings yet
Concurrency and Multithreading in C: POSIX Threads and Synchronization
From Everand
Concurrency and Multithreading in C: POSIX Threads and Synchronization
Larry Jones
No ratings yet
.NET Mastery: The .NET Interview Questions and Answers
From Everand
.NET Mastery: The .NET Interview Questions and Answers
Chetan Singh
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
Machine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition)
From Everand
Machine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition)
Suhas Pote
No ratings yet
Implementing Design Patterns in C# 11 and .NET 7 - 2nd Edition: Learn how to design and develop robust and scalable applications using design patterns (English Edition)
From Everand
Implementing Design Patterns in C# 11 and .NET 7 - 2nd Edition: Learn how to design and develop robust and scalable applications using design patterns (English Edition)
Alexandre F. Malavasi Cardoso
No ratings yet
Inter-Service Communication with Go: Mastering protocols, queues, and event-driven architectures in Go (English Edition)
From Everand
Inter-Service Communication with Go: Mastering protocols, queues, and event-driven architectures in Go (English Edition)
Dušan Stojanović
No ratings yet
Objective-C Programming Nuts and bolts
From Everand
Objective-C Programming Nuts and bolts
Keith Lee
No ratings yet
Learning .NET High-performance Programming
From Everand
Learning .NET High-performance Programming
Antonio Esposito
No ratings yet
Mastering C# Concurrency
From Everand
Mastering C# Concurrency
Agafonov Eugene
2/5 (2)
Learning Concurrent Programming in Scala
From Everand
Learning Concurrent Programming in Scala
Aleksandar Prokopec
No ratings yet
Linux, Apache, MySQL, PHP Performance End to End
From Everand
Linux, Apache, MySQL, PHP Performance End to End
Colin McKinnon
5/5 (1)
Ansible Linux Filesystem By Examples: 30+ Automation Examples on Linux File and Directory Operation for Modern IT Infrastructure
From Everand
Ansible Linux Filesystem By Examples: 30+ Automation Examples on Linux File and Directory Operation for Modern IT Infrastructure
Luca Berton
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Rust Crash Course: Build High-Performance, Efficient and Productive Software with the Power of Next-Generation Programming Skills (English Edition)
From Everand
Rust Crash Course: Build High-Performance, Efficient and Productive Software with the Power of Next-Generation Programming Skills (English Edition)
Abhishek Kumar
No ratings yet
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
From Everand
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
Bob Mather
5/5 (1)
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
From Everand
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
Abiprod Pty Ltd
5/5 (10)
Data Structures and Algorithms with Go: Create efficient solutions and optimize your Go coding skills (English Edition)
From Everand
Data Structures and Algorithms with Go: Create efficient solutions and optimize your Go coding skills (English Edition)
Dušan Stojanović
No ratings yet
Mastering Python Concurrency: Essential Techniques
From Everand
Mastering Python Concurrency: Essential Techniques
Ed Norex
No ratings yet
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Swift Programming Nuts and bolts
From Everand
Swift Programming Nuts and bolts
Keith Lee
No ratings yet
iOS Programming Nuts and bolts
From Everand
iOS Programming Nuts and bolts
Keith Lee
4/5 (1)
Distributed Computing with Python
From Everand
Distributed Computing with Python
Francesco Pierfederici
No ratings yet
Parallel Framework, The Need
No ratings yet
Parallel Framework, The Need
6 pages
Ansible For Containers and Kubernetes By Examples
From Everand
Ansible For Containers and Kubernetes By Examples
Berton
No ratings yet
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
From Everand
Protocol Buffers Handbook: Getting deeper into Protobuf internals and its usage
Clément Jean
No ratings yet
Dynamical Evolution of Multiple Four-Wave-Mixing Processes in An Optical Fiber
No ratings yet
Dynamical Evolution of Multiple Four-Wave-Mixing Processes in An Optical Fiber
18 pages
Practical Experiment No. 5: Brewster Angle Determination Physics Lab
100% (2)
Practical Experiment No. 5: Brewster Angle Determination Physics Lab
3 pages
Prefunctional Test Checklist - Exhaust - Airport
No ratings yet
Prefunctional Test Checklist - Exhaust - Airport
7 pages
WDC Assignment-1
No ratings yet
WDC Assignment-1
3 pages
SECTION 11 12 00 Parking Control Equipment
No ratings yet
SECTION 11 12 00 Parking Control Equipment
10 pages
What Are The Advantages and Benefits of Chirp Signals
100% (1)
What Are The Advantages and Benefits of Chirp Signals
2 pages
Spare Parts Manual (HB 4700 - HB 4700DP)
No ratings yet
Spare Parts Manual (HB 4700 - HB 4700DP)
36 pages
Ads QT 24 0233-R2
No ratings yet
Ads QT 24 0233-R2
4 pages
Problems: and Loward
No ratings yet
Problems: and Loward
5 pages
At Tiny 2313 LCD and Keypad Controller
100% (1)
At Tiny 2313 LCD and Keypad Controller
7 pages
Gusset Plate Checks
No ratings yet
Gusset Plate Checks
7 pages
Formula For Aerodynamic Heating
No ratings yet
Formula For Aerodynamic Heating
3 pages
We I Bull Distribution
No ratings yet
We I Bull Distribution
3 pages
Data Archiving in Sales and Distribution (SD)
No ratings yet
Data Archiving in Sales and Distribution (SD)
48 pages
Friction Stir Welding
No ratings yet
Friction Stir Welding
12 pages
Electrical Testing Commissioning
No ratings yet
Electrical Testing Commissioning
5 pages
Applications: Bare Copper Conductor. Solid and Stranded
No ratings yet
Applications: Bare Copper Conductor. Solid and Stranded
3 pages
Tugas 2 Hidrodinamika 1 Dosen Bapak Sujantoko ST, MT
No ratings yet
Tugas 2 Hidrodinamika 1 Dosen Bapak Sujantoko ST, MT
17 pages
11 Mathematics Ncert Ch07 Permutations and Combinations 7.1
No ratings yet
11 Mathematics Ncert Ch07 Permutations and Combinations 7.1
3 pages
English For AB Seamen
No ratings yet
English For AB Seamen
96 pages
S.1 Short Title of Commencement: Kolkata Port Trust Scale of Rates Gazette No.65 Dated 26.02.2014 General
No ratings yet
S.1 Short Title of Commencement: Kolkata Port Trust Scale of Rates Gazette No.65 Dated 26.02.2014 General
42 pages
Measuring Osnr in WDM Systems: - Effects of Resolution Bandwidth and Optical Rejection Ratio
No ratings yet
Measuring Osnr in WDM Systems: - Effects of Resolution Bandwidth and Optical Rejection Ratio
16 pages
Merchant Navy 1 SB
100% (4)
Merchant Navy 1 SB
32 pages
Installation, Operation and Maintenance Instructions: Vibratory Feeders
No ratings yet
Installation, Operation and Maintenance Instructions: Vibratory Feeders
12 pages
Manual Multímetro PDF
No ratings yet
Manual Multímetro PDF
200 pages
Sapal 2in1
No ratings yet
Sapal 2in1
1 page
Averna Dp-360 Brochure
No ratings yet
Averna Dp-360 Brochure
4 pages
Gas Agency Management
40% (5)
Gas Agency Management
18 pages