Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
The information contained in this document is provided for informational purposes only and represents the current view of Intel Corporation ("Intel") and
its contributors ("Contributors") on, as of the date of publication. Intel and the Contributors make no commitment to update the information contained
in this document, and Intel reserves the right to make changes at any time, without notice.
Legal Lines and Disclaimers
DISCLAIMER. THIS DOCUMENT, IS PROVIDED "AS IS." NEITHER INTEL, NOR THE CONTRIBUTORS MAKE ANY REPRESENTATIONS OF ANY KIND WITH
RESPECT TO PRODUCTS REFERENCED HEREIN, WHETHER SUCH PRODUCTS ARE THOSE OF INTEL, THE CONTRIBUTORS, OR THIRD PARTIES. INTEL,
AND ITS CONTRIBUTORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES, IMPLIED OR EXPRESS, INCLUDING WITHOUT LIMITATION, ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, NON-INFRINGEMENT, AND ANY WARRANTY ARISING OUT OF THE
INFORMATION CONTAINED HEREIN, INCLUDING WITHOUT LIMITATION, ANY PRODUCTS, SPECIFICATIONS, OR OTHER MATERIALS REFERENCED
HEREIN. INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT THIS DOCUMENT IS FREE FROM ERRORS, OR THAT ANY PRODUCTS OR OTHER
TECHNOLOGY DEVELOPED IN CONFORMANCE WITH THIS DOCUMENT WILL PERFORM IN THE INTENDED MANNER, OR WILL BE FREE FROM
INFRINGEMENT OF THIRD PARTY PROPRIETARY RIGHTS, AND INTEL, AND ITS CONTRIBUTORS DISCLAIM ALL LIABILITY THEREFOR.
INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT ANY PRODUCT REFERENCED HEREIN OR ANY PRODUCT OR TECHNOLOGY DEVELOPED IN
RELIANCE UPON THIS DOCUMENT, IN WHOLE OR IN PART, WILL BE SUFFICIENT, ACCURATE, RELIABLE, COMPLETE, FREE FROM DEFECTS OR SAFE FOR
ITS INTENDED PURPOSE, AND HEREBY DISCLAIM ALL LIABILITIES THEREFOR. ANY PERSON MAKING, USING OR SELLING SUCH PRODUCT OR
TECHNOLOGY DOES SO AT HIS OR HER OWN RISK.
Licenses may be required. Intel, its contributors and others may have patents or pending patent applications, trademarks, copyrights or other
intellectual proprietary rights covering subject matter contained or described in this document. No license, express, implied, by estoppels or otherwise,
to any intellectual property rights of Intel or any other party is granted herein. It is your responsibility to seek licenses for such intellectual property
rights from Intel and others where appropriate.
Limited License Grant. Intel hereby grants you a limited copyright license to copy this document for your use and internal distribution only. You may not
distribute this document externally, in whole or in part, to any other person or entity.
LIMITED LIABILITY. IN NO EVENT SHALL INTEL, OR ITS CONTRIBUTORS HAVE ANY LIABILITY TO YOU OR TO ANY OTHER THIRD PARTY, FOR ANY LOST
PROFITS, LOST DATA, LOSS OF USE OR COSTS OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR FOR ANY DIRECT, INDIRECT, SPECIAL OR
CONSEQUENTIAL DAMAGES ARISING OUT OF YOUR USE OF THIS DOCUMENT OR RELIANCE UPON THE INFORMATION CONTAINED HEREIN, UNDER ANY
CAUSE OF ACTION OR THEORY OF LIABILITY, AND IRRESPECTIVE OF WHETHER INTEL, OR ANY CONTRIBUTOR HAS ADVANCE NOTICE OF THE
POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING THE FAILURE OF THE ESSENTIAL PURPOSE OF ANY LIMITED
REMEDY.
Intel and Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2007, Intel Corporation. All Rights Reserved.
Contents
Contents
Lab 1: Identifying Parallelism .............................................................................. 7
Lab 2: Introducing Threads................................................................................ 11
Lab 3: Domain Decomposition with OpenMP ...................................................... 15
Lab 4: Critical Sections and Reductions with OpenMP ........................................ 21
Lab 5: Implementing Task Decompositions........................................................ 25
Lab 6: Analyzing Parallel Performance............................................................... 29
Lab 7: Improving Parallel Performance.............................................................. 33
Lab 8: Choosing the Appropriate Thread Model .................................................. 35
Instructors Notes and Solutions........................................................................ 37
Thirty minutes
Part A
For each of the following code segments, draw a dependence graph and determine whether the
computation is suitable for parallelization. If the computation is suitable for parallelization, decide
how it should be divided among three CPUs. You may assume that all functions are free of side
effects.
Example 1:
for (i = 0; i < 4; i++) {
a[i] = 0.25 * i;
b[i] = 4.0 / (a[i] * a[i]);
}
Example 2:
if (a < b) c = f(-1);
else if (a == b) c = f(0);
else c = f(1);
Example 3:
for (i = 0; i < 4; i++)
for (j = 0; j < 3; j++)
a[i][j] = f(a[i][j] * b[j]);
Example 4:
prime = 2;
do {
first = prime * prime;
for (i = first; i < 10; i+= prime)
marked[i] = 1;
while (marked[++prime]);
} while (prime * prime < N);
Example 5:
switch (i): {
case 0:
a = f(x);
b = g(y);
break;
case 1:
a = g(x);
b = f(y);
break;
case -1:
a = f(y);
b = f(x);
break;
}
Example 6:
sum = 0.0;
for (i = 0; i < 9; i++)
sum = sum + b[i];
Part B
Describe how parallelism could be used to reduce the time needed to perform each of the following
tasks.
Example 7:
A relational database table contains (among other things) student ID numbers and their cumulative
GPAs. Find out the percentage of students with a cumulative GPA greater than 3.5.
Example 8:
A ray-tracing program renders a realistic image by tracing one or more rays for each pixel of the
display window.
Example 9:
An operating system utility searches a disk and identifies every text file containing a particular
phrase specified by the user.
Example 10:
We want to improve a game similar to Civilization IV by reducing the amount of time the human
player must wait for the virtual world to be set up.
Thirty minutes
2.
decide whether the best thread model is the fork/join model or the general threads model;
3.
determine fork/join points (in the case of the fork/join model) or thread creation points (in
the case of the general threads model); and
4.
decide which variables should be shared and which variables should be private.
Example 1:
/* Matrix multiplication */
int i, j, k;
double **a, **b, **c, tmp;
...
for (i = 0; i < m; i++)
for (j = 0; j < n; j++) {
tmp = 0.0;
for (k = 0; k < p; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
Example 2:
/* This program implements an Internet-based service that
responds to number-theoretic queries */
int main() {
request r;
...
while(1) {
next_request(&r);
acknowledge_request (r);
switch (r.type) {
case PRIME:
primality_test (r);
break;
case PERFECT: perfect_test (r);
break;
case WARING: find_waring_integer (r);
break;
}
}
...
}
Example 3:
double inner_product (double *x, double *y, int n)
{
int i;
double result;
result = 0.0;
for (i = 0; i < n; i++)
result += x[i] * y[i];
return result;
}
int main (int argc, char *argv[])
{
double *d, *g, w, x, y, z;
int i;
...
for (i = 0; i < n; i++)
d[i] = -g[i] + (w/x) * d[i];
y = inner_product (d, g);
z = inner_product (d, t);
...
}
Example 4:
/* Finite difference method to solve string vibration
problem (from Michael J. Quinn, Parallel Programming
in C with MPI and OpenMP, p. 325) */
#include <stdio.h>
#include <math.h>
#define
#define
#define
#define
#define
#define
#define
F(x)
G(x)
a
c
m
n
T
int main
{
float
int
float
float
float
sin(3.14159*(x))
0.0
1.0
2.0
2000
1000
1.0
h = a / n;
k = T / m;
L = (k*c/h)*(k*c/h);
for (j = 0; j <= m; j++) u[j][0] = u[j][n] = 0.0;
for (i = 1; i < n; i++) u[0][i] = F(i*h);
for (i = 1; i < n; i++)
u[1][i] = (L/2.0)*(u[0][i+1] + u[0][i-1])+
(1.0 - L) * u[0][i] + k * G(i*h);
for (j = 1; j < m; j++)
for (i = 1; i < n; i++)
u[j+1][i] = 2.0*(1.0 - L) * u[j][i] +
L*(u[j][i+1] + u[j][i-1]) u[j-1][i];
for (j = 0; j <= m; j++) {
for (i = 0; i <= n; i++) printf (%6.3f, u[j][i]);
putchar (\n);
}
return 0;
}
Fifty minutes
2.
3.
4.
Note: You will need to generate matrices for the matrix multiplication exercise; a utility
program gen.c is included in the lab folder for this purpose. Compile this code, and run it
to create files matrix_a and matrix_b; explicit usage is outlined in the code itself. Be sure
to generate a workload sufficiently large (e.g., matrix dimensions 1000 x 1000) to be
meaningful.
Program 1:
/*
*
*/
Matrix multiplication
#include <stdio.h>
/*
*
*
*/
void read_matrix (
char
*s,
/*
float ***subs,
/*
int
*m,
/*
int
*n)
/*
{
char
error_msg[80];
FILE
*fptr;
File name */
2D submatrix indices */
Number of rows in matrix */
Number of columns in matrix */
/*
*
*
*/
Program 2:
/*
*
*
*
*
*
*
*
*/
Polynomial Interpolation
This program demonstrates a function that performs polynomial
interpolation. The function is taken from "Numerical Recipes
in C", Second Edition, by William H. Press, Saul A. Teukolsky,
William T. Vetterling, and Brian P. Flannery.
#include <math.h>
#define N 20
#define X 14.5
}
free_vector (d, 1, n);
free_vector (c, 1, n);
}
/* Functions 'sign' and 'init' are used to initialize the
x and y vectors holding known values of the function.
*/
int sign (int j)
{
if (j % 2 == 0) return 1;
else return -1;
}
void init (int i, double *x, double *y)
{
int j;
*x = (double) i;
*y = sin(i);
}
/* Function 'main' demonstrates the polynomial interpolation function
by generating some test points and then calling 'polint' with a
value of x between two of the test points. */
int main (int argc, char *argv[])
{
double x, y, dy;
double *xa, *ya;
int i;
xa = vector (1, N);
ya = vector (1, N);
/* Initialize xa's and ya's */
for (i = 1; i <= N; i++) {
init (i, &xa[i], &ya[i]);
printf ("f(%4.2f) = %13.11f\n", xa[i], ya[i]);
}
/* Interpolate polynomial at X */
polint (xa, ya, N, X, &y, &dy);
printf ("\nf(%6.3f) = %13.11f with error bound %13.11f\n", X, y,
fabs(dy));
free_vector (xa, 1, N);
free_vector (ya, 1, N);
return 0;
}
Twenty minutes
Exercise 1
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.
/*
*
*
*
*
*
*
*
*
*/
/*
*
*
*
*
*/
}
/*
*
*
*
*/
Exercise 2
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.
/*
*
*
*
*
*
*/
#include <stdio.h>
#define MIN(a,b) ((a)<(b)?(a):(b))
int main (int argc, char *argv[])
{
int
count;
/* Prime count */
int
first;
/* Index of first multiple */
int
i;
int
index;
/* Index of current prime */
char *marked;
/* Marks for 2,...,'n' */
int
n;
/* Sieving from 2, ..., 'n' */
int
prime;
/* Current prime */
if (argc != 2) {
printf ("Command line: %s <m>\n", argv[0]);
exit (1);
}
n = atoi(argv[1]);
marked = (char *) malloc (n-1);
if (marked == NULL) {
printf ("Cannot allocate enough memory\n");
exit (1);
}
for (i = 0; i < n-1; i++) marked[i] = 1;
index = 0;
prime = 2;
do {
first = prime * prime - 2;
for (i = first; i < n-1; i += prime) marked[i] = 0;
while (!marked[++index]);
prime = index + 2;
} while (prime * prime <= n);
count = 0;
for (i = 0; i < n-1; i++)
count += marked[i];
printf ("There are %d primes less than or equal to %d\n", count, n);
return 0;
}
Exercise 3
The Monte Carlo method refers to the use of statistical sampling to solve a problem. Some experts
say that more than half of all supercomputing cycles are devoted to Monte Carlo computations. A
Monte Carlo program can benefit from parallel processing in two ways. Parallel processing can be
used to reduce the time needed to find a solution of a particular resolution. The other use of parallel
processing is to find a more accurate solution in the same amount of time. This assignment is to
reduce the time needed to find a solution of a particular accuracy. The following C program uses the
Monte Carlo method to come up with an approximation to pi. Add OpenMP directives to make the
program suitable for execution on multiple threads. Divide the number of points to be generated
evenly among the threads. Compare the execution times of the sequential, single-threaded, and
double-threaded programs.
/*
*
*
*
*/
#include <stdio.h>
int main (int argc, char *argv[1])
{
int count;
/* Points inside unit circle */
int i;
int samples;
/* Number of points to generate */
unsigned short xi[3];
/* Random number seed */
double x, y;
/* Coordinates of point */
Sixty minutes
Exercise 1
Make this quicksort program parallel by adding the appropriate OpenMP pragmas and clauses.
Compile the program, execute it on 1 and 2 threads, and make sure the program is still correctly
sorting the elements of array A. Finally, compare the execution times of the sequential, singlethreaded, and double-threaded programs.
/*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
Stack-based Quicksort
The quicksort algorithm works by repeatedly dividing unsorted
sub-arrays into two pieces: one piece containing the smaller
elements and the other piece containing the larger elements.
The splitter element, used to subdivide the unsorted sub-array,
ends up in its sorted location. By repeating this process on
smaller and smaller sub-arrays, the entire array gets sorted.
The typical implementation of quicksort uses recursion. This
implementation replaces recursion with iteration. It manages its
own stack of unsorted sub-arrays. When the stack of unsorted
sub-arrays is empty, the array is sorted.
#include <stdio.h>
#include <stdlib.h>
#define MAX_UNFINISHED 1000
int unfinished_index;
float *A;
int
n;
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
/* Function 'partition' actually does the sorting by dividing an
Unsorted sub-array into two parts: those less than or equal to the
splitter, and those greater than the splitter. The splitter is the
last element in the unsorted sub-array. The splitter ends up in its
final, sorted location. The function returns the final location of
the splitter (its index). This code is an implementation of the
algorithm appearing in Introduction to Algorithms, Second Edition,
by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2001). */
int partition (int first, int last)
{
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
i++;
swap (&A[i], &A[j]);
}
swap (&A[i+1], &A[last]);
return (i+1);
}
/* Function 'quicksort' repeatedly retrieves the indices of unsorted
sub-arrays from the stack and calls 'partition' to divide these
sub-arrays into two pieces. It keeps one of the pieces and puts the
other piece on the stack of unsorted sub-arrays. Eventually it ends
up with a piece that doesn't need to be sorted. At this point it
gets the indices of another unsorted sub-array from the stack. The
function continues until the stack is empty. */
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
/* Split point in array */
while (unfinished_index >= 0) {
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last) {
if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
}
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
print_float_array (A, n);
*/
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0;
quicksort ();
/*
print_float_array (A, n);
*/
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
}
Thirty-five minutes
Exercise 1
You are responsible for maintaining a library of core functions used by a wide variety of programs in
an application suite. Your supervisor has noted the availability of multi-core processors and wants to
know whether rewriting the library of functions using threads would significantly improve the
performance of the programs in the application suite. What do you need to do to provide a
meaningful answer?
Exercise 2
Somebody wrote an OpenMP program to solve the problem posed in Lab 5 and benchmarked its
performance sorting 25 million keys. Here are the run times of the program, as reported by the
command-line utility time:
Threads
1
2
3
4
What is the efficiency of the multithreaded program for 2, 3, and 4 threads? What can you conclude
about the design of the parallel program? Can you offer any suggestions for improving the
performance of the program?
Exercise 3
A co-worker has been working on converting a sequential program into a multithreaded program. At
this point, only some of the functions of the program have been made parallel. On a key data set,
the multithreaded program exhibits these execution times:
Processors
1
2
3
4
Time (sec)
5.34
3.74
3.31
3.10
Is your co-worker on the right track? Would you advise your co-worker to continue the
parallelization effort?
Exercise 4
Youve worked hard to convert a key application to multithreaded execution, and youve
benchmarked it on a quad-core processor. Here are the results:
Threads
1
2
3
4
Time (sec)
24.3
14.6
11.7
10.6
Exercise 5
You have benchmarked your multithreaded application on a system with CPU A, and it exhibits this
performance:
Threads
1
2
3
4
Time (sec)
14.20
7.81
5.87
4.72
Next you benchmark the same application on an otherwise identical system that has been upgraded
with a newer processor, CPU B, and it exhibits this performance:
Threads
1
2
3
4
Time (sec)
11.83
7.01
5.42
4.59
CPU B is clearly faster than CPU A. The execution times are lower when CPU B is used. However, the
single-processor performance is improved by 20% by using CPU B. In contrast, when four
processors are engaged, the parallel program is only 3% faster. Explain how this can happen.
Exercise 6
Hard disk drives continue to improve in speed at a slower rate than microprocessors. What are the
implications of this trend for developers of multithreaded applications? What can be done about it?
Forty-five minutes
Exercise 1
Recall that the parallel quicksort program developed in Lab 5 exhibited poor performance because of
excessive contention among the tasks for access to the shared stack containing the indices of
unsorted sub-arrays. You can dramatically improve the performance by reducing the frequency at
which threads access the shared stack.
One way to reduce accesses to the shared stack is to switch to sequential quicksort for sub-arrays
smaller than a threshold size. In other words, when a thread encounters a sub-array smaller than
the threshold size and partitions it into two pieces, it does not put one piece on the stack and work
on the remaining piece. Instead, it sorts both pieces itself by recursively calling the sequential
quicksort function.
Use this strategy, and the sequential quicksort function given below, to improve the performance of
the parallel quicksort program you developed in Lab 5. Run some experiments to determine the best
threshold size for switching to sequential quicksort.
Exercise 2
The following C program counts the number of primes less than n. Use OpenMP pragmas and clauses
to enable it to run on a multiprocessor. Make as many changes as you can in the time allowed to
improve the performance of the program on the maximum available number of processors.
/*
*
*/
#include <stdio.h>
#include <math.h>
#include <omp.h>
/*