0% found this document useful (0 votes)

298 views

Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF

Intel(r) Software College Introduction to Parallel Programming Student Workbook with Instructor's Notes. Information in THIS DOCUMENT IS PROVIDED for informational purposes only. Intel and THE CONTRIBUTORS make no commitment to update the information.

Uploaded by

Ahmed Khazal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

298 views

Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF

Uploaded by

Ahmed Khazal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Introduction to Parallel Programming

Student Workbook with Instructors Notes

Intel Software College

Legal Lines and Disclaimers

Introduction to Parallel Programming

Student Workbook with Instructors Notes - Inner Front Cover

Introduction to Parallel Programming

Student Workbook with Instructors Notes
Intel Software College

The information contained in this document is provided for informational purposes only and represents the current view of Intel Corporation ("Intel") and
its contributors ("Contributors") on, as of the date of publication. Intel and the Contributors make no commitment to update the information contained
in this document, and Intel reserves the right to make changes at any time, without notice.
Legal Lines and Disclaimers

DISCLAIMER. THIS DOCUMENT, IS PROVIDED "AS IS." NEITHER INTEL, NOR THE CONTRIBUTORS MAKE ANY REPRESENTATIONS OF ANY KIND WITH
RESPECT TO PRODUCTS REFERENCED HEREIN, WHETHER SUCH PRODUCTS ARE THOSE OF INTEL, THE CONTRIBUTORS, OR THIRD PARTIES. INTEL,
AND ITS CONTRIBUTORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES, IMPLIED OR EXPRESS, INCLUDING WITHOUT LIMITATION, ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, NON-INFRINGEMENT, AND ANY WARRANTY ARISING OUT OF THE
INFORMATION CONTAINED HEREIN, INCLUDING WITHOUT LIMITATION, ANY PRODUCTS, SPECIFICATIONS, OR OTHER MATERIALS REFERENCED
HEREIN. INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT THIS DOCUMENT IS FREE FROM ERRORS, OR THAT ANY PRODUCTS OR OTHER
TECHNOLOGY DEVELOPED IN CONFORMANCE WITH THIS DOCUMENT WILL PERFORM IN THE INTENDED MANNER, OR WILL BE FREE FROM
INFRINGEMENT OF THIRD PARTY PROPRIETARY RIGHTS, AND INTEL, AND ITS CONTRIBUTORS DISCLAIM ALL LIABILITY THEREFOR.
INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT ANY PRODUCT REFERENCED HEREIN OR ANY PRODUCT OR TECHNOLOGY DEVELOPED IN
RELIANCE UPON THIS DOCUMENT, IN WHOLE OR IN PART, WILL BE SUFFICIENT, ACCURATE, RELIABLE, COMPLETE, FREE FROM DEFECTS OR SAFE FOR
ITS INTENDED PURPOSE, AND HEREBY DISCLAIM ALL LIABILITIES THEREFOR. ANY PERSON MAKING, USING OR SELLING SUCH PRODUCT OR
TECHNOLOGY DOES SO AT HIS OR HER OWN RISK.
Licenses may be required. Intel, its contributors and others may have patents or pending patent applications, trademarks, copyrights or other
intellectual proprietary rights covering subject matter contained or described in this document. No license, express, implied, by estoppels or otherwise,
to any intellectual property rights of Intel or any other party is granted herein. It is your responsibility to seek licenses for such intellectual property
rights from Intel and others where appropriate.
Limited License Grant. Intel hereby grants you a limited copyright license to copy this document for your use and internal distribution only. You may not
distribute this document externally, in whole or in part, to any other person or entity.
LIMITED LIABILITY. IN NO EVENT SHALL INTEL, OR ITS CONTRIBUTORS HAVE ANY LIABILITY TO YOU OR TO ANY OTHER THIRD PARTY, FOR ANY LOST
PROFITS, LOST DATA, LOSS OF USE OR COSTS OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR FOR ANY DIRECT, INDIRECT, SPECIAL OR
CONSEQUENTIAL DAMAGES ARISING OUT OF YOUR USE OF THIS DOCUMENT OR RELIANCE UPON THE INFORMATION CONTAINED HEREIN, UNDER ANY
CAUSE OF ACTION OR THEORY OF LIABILITY, AND IRRESPECTIVE OF WHETHER INTEL, OR ANY CONTRIBUTOR HAS ADVANCE NOTICE OF THE
POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING THE FAILURE OF THE ESSENTIAL PURPOSE OF ANY LIMITED
REMEDY.
Intel and Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2007, Intel Corporation. All Rights Reserved.

Introduction to Parallel Programming

Student Workbook with Instructors Notes
4

Intel Software College

January 2007 Intel Corporation

Contents

Contents
Lab 1: Identifying Parallelism .............................................................................. 7
Lab 2: Introducing Threads................................................................................ 11
Lab 3: Domain Decomposition with OpenMP ...................................................... 15
Lab 4: Critical Sections and Reductions with OpenMP ........................................ 21
Lab 5: Implementing Task Decompositions........................................................ 25
Lab 6: Analyzing Parallel Performance............................................................... 29
Lab 7: Improving Parallel Performance.............................................................. 33
Lab 8: Choosing the Appropriate Thread Model .................................................. 35
Instructors Notes and Solutions........................................................................ 37

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
5

Introduction to Parallel Programming

Student Workbook with Instructors Notes
6

Intel Software College

January 2007 Intel Corporation

Lab 1: Identifying Parallelism

Time Required

Thirty minutes

Part A
For each of the following code segments, draw a dependence graph and determine whether the
computation is suitable for parallelization. If the computation is suitable for parallelization, decide
how it should be divided among three CPUs. You may assume that all functions are free of side
effects.

Example 1:
for (i = 0; i < 4; i++) {
a[i] = 0.25 * i;
b[i] = 4.0 / (a[i] * a[i]);
}
Example 2:
if (a < b) c = f(-1);
else if (a == b) c = f(0);
else c = f(1);
Example 3:
for (i = 0; i < 4; i++)
for (j = 0; j < 3; j++)
a[i][j] = f(a[i][j] * b[j]);
Example 4:
prime = 2;
do {
first = prime * prime;
for (i = first; i < 10; i+= prime)
marked[i] = 1;
while (marked[++prime]);
} while (prime * prime < N);

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
7

Introduction to Parallel Programming

Example 5:
switch (i): {
case 0:
a = f(x);
b = g(y);
break;
case 1:
a = g(x);
b = f(y);
break;
case -1:
a = f(y);
b = f(x);
break;
}
Example 6:
sum = 0.0;
for (i = 0; i < 9; i++)
sum = sum + b[i];

Introduction to Parallel Programming

Student Workbook with Instructors Notes
8

Intel Software College

January 2007 Intel Corporation

Lab. 1: Identifying Parallelism

Part B
Describe how parallelism could be used to reduce the time needed to perform each of the following
tasks.

Example 7:
A relational database table contains (among other things) student ID numbers and their cumulative
GPAs. Find out the percentage of students with a cumulative GPA greater than 3.5.

Example 8:
A ray-tracing program renders a realistic image by tracing one or more rays for each pixel of the
display window.

Example 9:
An operating system utility searches a disk and identifies every text file containing a particular
phrase specified by the user.

Example 10:
We want to improve a game similar to Civilization IV by reducing the amount of time the human
player must wait for the virtual world to be set up.

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
9

Introduction to Parallel Programming

Student Workbook with Instructors Notes
10

Intel Software College

January 2007 Intel Corporation

Lab 2: Introducing Threads

Time Required

Thirty minutes

For each of the following programs or program segments:

determine whether the best parallelization approach is a domain decomposition or a task

decomposition;

decide whether the best thread model is the fork/join model or the general threads model;

determine fork/join points (in the case of the fork/join model) or thread creation points (in
the case of the general threads model); and

decide which variables should be shared and which variables should be private.

Example 1:
/* Matrix multiplication */
int i, j, k;
double **a, **b, **c, tmp;
...
for (i = 0; i < m; i++)
for (j = 0; j < n; j++) {
tmp = 0.0;
for (k = 0; k < p; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}

Example 2:
/* This program implements an Internet-based service that
responds to number-theoretic queries */
int main() {
request r;
...
while(1) {
next_request(&r);
acknowledge_request (r);
switch (r.type) {

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
11

Introduction to Parallel Programming

case PRIME:

primality_test (r);
break;
case PERFECT: perfect_test (r);
break;
case WARING: find_waring_integer (r);
break;
}
}
...
}

Example 3:
double inner_product (double *x, double *y, int n)
{
int i;
double result;
result = 0.0;
for (i = 0; i < n; i++)
result += x[i] * y[i];
return result;
}
int main (int argc, char *argv[])
{
double *d, *g, w, x, y, z;
int i;
...
for (i = 0; i < n; i++)
d[i] = -g[i] + (w/x) * d[i];
y = inner_product (d, g);
z = inner_product (d, t);
...
}

Example 4:
/* Finite difference method to solve string vibration
problem (from Michael J. Quinn, Parallel Programming
in C with MPI and OpenMP, p. 325) */
#include <stdio.h>
#include <math.h>

Introduction to Parallel Programming

Student Workbook with Instructors Notes
12

Intel Software College

January 2007 Intel Corporation

Lab. 2: Introducing Threads

#define
#define
#define
#define
#define
#define
#define

F(x)
G(x)
a
c
m
n
T

int main
{
float
int
float
float
float

sin(3.14159*(x))
0.0
1.0
2.0
2000
1000
1.0

(int argc, char *argv[])

h;
i, j;
k;
L;
u[m+1][n+1];

h = a / n;
k = T / m;
L = (k*c/h)*(k*c/h);
for (j = 0; j <= m; j++) u[j][0] = u[j][n] = 0.0;
for (i = 1; i < n; i++) u[0][i] = F(i*h);
for (i = 1; i < n; i++)
u[1][i] = (L/2.0)*(u[0][i+1] + u[0][i-1])+
(1.0 - L) * u[0][i] + k * G(i*h);
for (j = 1; j < m; j++)
for (i = 1; i < n; i++)
u[j+1][i] = 2.0*(1.0 - L) * u[j][i] +
L*(u[j][i+1] + u[j][i-1]) u[j-1][i];
for (j = 0; j <= m; j++) {
for (i = 0; i <= n; i++) printf (%6.3f, u[j][i]);
putchar (\n);
}
return 0;
}

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
13

Introduction to Parallel Programming

Student Workbook with Instructors Notes
14

Intel Software College

January 2007 Intel Corporation

Lab 3: Domain Decomposition with OpenMP

Time Required

Fifty minutes

For each of the programs below

make the program parallel by adding the appropriate OpenMP pragmas;

compile the program;

execute the program for 1, 2, 3, and 4 threads; and

check the program outputs to verify they are the same.

Note: You will need to generate matrices for the matrix multiplication exercise; a utility
program gen.c is included in the lab folder for this purpose. Compile this code, and run it
to create files matrix_a and matrix_b; explicit usage is outlined in the code itself. Be sure
to generate a workload sufficiently large (e.g., matrix dimensions 1000 x 1000) to be
meaningful.

Program 1:
/*
*
*/

Matrix multiplication

#include <stdio.h>
/*
*
*
*/

Function 'rerror' is called when the program detects an

error and wishes to print an appropriate message and exit.

void rerror (char *s)

{
printf ("%s\n", s);
exit (-1);
}
/*
*
*
*/

Function 'allocate_matrix", passed the number of rows and columns,

allocates a two-dimensional matrix of floats.

void allocate_matrix (float ***subs, int rows, int cols) {

int
i;
float *lptr, *rptr;
float *storage;

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
15

Introduction to Parallel Programming

storage = (float ) malloc (rows cols * sizeof(float));

*subs = (float **) malloc (rows * sizeof(float *));
for (i = 0; i < rows; i++)
(*subs)[i] = &storage[i*cols];
return;
}
/*
*
*
*/

Given the name of a file containing a matrix of floats, function

'read_matrix' opens the file and reads its contents.

void read_matrix (
char
*s,
/*
float ***subs,
/*
int
*m,
/*
int
*n)
/*
{
char
error_msg[80];
FILE
*fptr;

File name */
2D submatrix indices */
Number of rows in matrix */
Number of columns in matrix */

/* Input file pointer */

fptr = fopen (s, "r");

if (fptr == NULL) {
sprintf (error_msg, "Can't open file '%s'", s);
rerror (error_msg);
}
fread (m, sizeof(int), 1, fptr);
fread (n, sizeof(int), 1, fptr);
allocate_matrix (subs, *m, *n);
fread ((*subs)[0], sizeof(float), *m * *n, fptr);
fclose (fptr);
return;
}
/*
*
*
*
*
*/

Passed a pointer to a two-dimensional matrix of floats and

the dimensions of the matrix, function 'print_matrix' prints
the matrix elements to standard output. If the matrix has more
than 10 columns, the output may not be easy to read.

void print_matrix (float **a, int rows, int cols)

{
int i, j;
for (i = 0; i < rows; i++) {
for (j = 0; j < cols; j++)
printf ("%6.2f ", a[i][j]);
putchar ('\n');
}
putchar ('\n');
return;
}

Introduction to Parallel Programming

Student Workbook with Instructors Notes
16

Intel Software College

January 2007 Intel Corporation

Lab. 3: Domain Decomposition with OpenMP

/*
*
*
*/

Function 'matrix_multiply' multiplies two matrices containing

floating-point numbers.

void matrix_multiply (float a, float b, float **c,

int arows, int acols, int bcols)
{
int i, j, k;
float tmp;
for (i = 0; i < arows; i++)
for (j = 0; j < bcols; j++) {
tmp = 0.0;
for (k = 0; k < acols; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
return;
}
int main (int *argc, char *argv[])
{
int m1, n1;
/* Dimensions of matrix 'a' */
int m2, n2;
/* Dimensions of matrix 'b' */
float **a, **b;
/* Two matrices being multiplied */
float **c;
/* Product matrix */
read_matrix ("matrix_a", &a, &m1, &n1);
print_matrix (a, m1, n1);
read_matrix ("matrix_b", &b, &m2, &n2);
print_matrix (b, m2, n2);
if (n1 != m2) rerror ("Incompatible matrix dimensions");
allocate_matrix (&c, m1, n2);
matrix_multiply (a, b, c, m1, n1, n2);
print_matrix (c, m1, n2);
return 0;
}

Program 2:
/*
*
*
*
*
*
*
*
*/

Polynomial Interpolation
This program demonstrates a function that performs polynomial
interpolation. The function is taken from "Numerical Recipes
in C", Second Edition, by William H. Press, Saul A. Teukolsky,
William T. Vetterling, and Brian P. Flannery.

#include <math.h>
#define N 20

Intel Software College

January 2007 Intel Corporation
-

/* Number of function sample points */

Introduction to Parallel Programming

Student Workbook with Instructors Notes
17

Introduction to Parallel Programming

#define X 14.5

/* Interpolate at this value of x */

/* Function 'vector' is used to allocate vectors with subscript

range v[nl..nh] */
double *vector (long nl, long nh)
{
double *v;
v = (double *) malloc(((nh-nl+2)*sizeof(double)));
return v-nl+1;
}
/* Function 'free_vector' is used to free up memory allocated
with function 'vector' */
void free_vector(double *v, long nl, long nh)
{
free ((char *) (v+nl-1));
}
/* Function 'polint' performs a polynomial interpolation */
void polint (double xa[], double ya[], int n, double x, double *y, double
*dy)
{
int i, m, ns=1;
double den,dif,dift,ho,hp,w;
double *c, *d;
dif = fabs(x-xa[1]);
c = vector(1,n);
d = vector(1,n);
for (i=1; i <= n; i++) {
dift = fabs (x - xa[i]);
if (dift < dif) {
ns = i;
dif = dift;
}
c[i] = ya[i];
d[i] = ya[i];
}
*y = ya[ns--];
for (m = 1; m < n; m++) {
for (i = 1; i <= n-m; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
*y += (*dy=(2*ns < (n-m) ? c[ns+1] : d[ns--]));

Introduction to Parallel Programming

Student Workbook with Instructors Notes
18

Intel Software College

January 2007 Intel Corporation

Lab. 3: Domain Decomposition with OpenMP

}
free_vector (d, 1, n);
free_vector (c, 1, n);
}
/* Functions 'sign' and 'init' are used to initialize the
x and y vectors holding known values of the function.
*/
int sign (int j)
{
if (j % 2 == 0) return 1;
else return -1;
}
void init (int i, double *x, double *y)
{
int j;
*x = (double) i;
*y = sin(i);
}
/* Function 'main' demonstrates the polynomial interpolation function
by generating some test points and then calling 'polint' with a
value of x between two of the test points. */
int main (int argc, char *argv[])
{
double x, y, dy;
double *xa, *ya;
int i;
xa = vector (1, N);
ya = vector (1, N);
/* Initialize xa's and ya's */
for (i = 1; i <= N; i++) {
init (i, &xa[i], &ya[i]);
printf ("f(%4.2f) = %13.11f\n", xa[i], ya[i]);
}
/* Interpolate polynomial at X */
polint (xa, ya, N, X, &y, &dy);
printf ("\nf(%6.3f) = %13.11f with error bound %13.11f\n", X, y,
fabs(dy));
free_vector (xa, 1, N);
free_vector (ya, 1, N);
return 0;
}

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
19

Introduction to Parallel Programming

Student Workbook with Instructors Notes
20

Intel Software College

January 2007 Intel Corporation

Lab 4: Critical Sections and Reductions

with OpenMP
Time Required

Twenty minutes

Exercise 1
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.

/*
*
*
*
*
*
*
*
*
*/
/*
*
*
*
*
*/

A small college is thinking of instituting a six-digit student ID

number. It wants to know how many "acceptable" ID numbers there
are. An ID number is "acceptable" if it has no two consecutive
identical digits and the sum of the digits is not 7, 11, or 13.
024332 is not acceptable because of the repeated 3s.
204124 is not acceptable because the digits add up to 13.
304530 is acceptable.

Function "no_problem_with_digits" extracts the digits from

the ID number from right to left, making sure that there are
no repeated digits and that the sum of the digits is not 7,
11, or 13.

int no_problem_with_digits (int i)

{
int j;
int latest;
/* Digit currently being examined */
int prior;
/* Digit to the right of "latest" */
int sum;
/* Sum of the digits */
prior = -1;
sum = 0;
for (j = 0; j < 6; j++) {
latest = i % 10;
if (latest == prior) return 0;
sum += latest;
prior = latest;
i /= 10;
}
if ((sum == 7) || (sum == 11) || (sum == 13)) return 0;
return 1;

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
21

Introduction to Parallel Programming

}
/*
*
*
*
*/

Function "main" iterates through all possible six-digit ID

numbers (integers from 0 to 999999), counting the ones that
meet the college's definition of "acceptable."

int main (void)

{
int count;
/* Count of acceptable ID numbers */
int i;
count = 0;
for (i = 0; i < 1000000; i++)
if (no_problem_with_digits (i)) count++;
printf ("There are %d acceptable ID numbers\n", count);
return 0;
}

Exercise 2
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.

/*
*
*
*
*
*
*/

This program uses the Sieve of Eratosthenes to determine the

number of prime numbers less than or equal to 'n'.
Adapted from code appearing in Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).

#include <stdio.h>
#define MIN(a,b) ((a)<(b)?(a):(b))
int main (int argc, char *argv[])
{
int
count;
/* Prime count */
int
first;
/* Index of first multiple */
int
i;
int
index;
/* Index of current prime */
char *marked;
/* Marks for 2,...,'n' */
int
n;
/* Sieving from 2, ..., 'n' */
int
prime;
/* Current prime */
if (argc != 2) {
printf ("Command line: %s <m>\n", argv[0]);
exit (1);
}

Introduction to Parallel Programming

Student Workbook with Instructors Notes
22

Intel Software College

January 2007 Intel Corporation

Lab. 4: Critical Sections and Reductions with OpenMP

n = atoi(argv[1]);
marked = (char *) malloc (n-1);
if (marked == NULL) {
printf ("Cannot allocate enough memory\n");
exit (1);
}
for (i = 0; i < n-1; i++) marked[i] = 1;
index = 0;
prime = 2;
do {
first = prime * prime - 2;
for (i = first; i < n-1; i += prime) marked[i] = 0;
while (!marked[++index]);
prime = index + 2;
} while (prime * prime <= n);
count = 0;
for (i = 0; i < n-1; i++)
count += marked[i];
printf ("There are %d primes less than or equal to %d\n", count, n);
return 0;
}

Exercise 3

The Monte Carlo method refers to the use of statistical sampling to solve a problem. Some experts
say that more than half of all supercomputing cycles are devoted to Monte Carlo computations. A
Monte Carlo program can benefit from parallel processing in two ways. Parallel processing can be
used to reduce the time needed to find a solution of a particular resolution. The other use of parallel
processing is to find a more accurate solution in the same amount of time. This assignment is to
reduce the time needed to find a solution of a particular accuracy. The following C program uses the
Monte Carlo method to come up with an approximation to pi. Add OpenMP directives to make the
program suitable for execution on multiple threads. Divide the number of points to be generated
evenly among the threads. Compare the execution times of the sequential, single-threaded, and
double-threaded programs.

/*
*
*
*
*/

This program uses the Monte Carlo method to come up with an

approximation to pi. Taken from Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).

#include <stdio.h>
int main (int argc, char *argv[1])
{
int count;
/* Points inside unit circle */
int i;
int samples;
/* Number of points to generate */
unsigned short xi[3];
/* Random number seed */

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
23

Introduction to Parallel Programming

double x, y;

/* Coordinates of point */

/* Number of points and 3 random number seeds are command-line

arguments. */
if (argc != 5) {
printf (Command-line syntax: %s <samples>
<seed> <seed> <seed>\n, argv[0]);
exit (-1);
}
samples = atoi (argv[1]);
count = 0;
xi[0] = atoi(argv[2]);
xi[1] = atoi(argv[3]);
xi[2] = atoi(argv[4]);
for (i = 0; i < samples; i++) {
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
}
printf (Estimate of pi: %7.5f\n, 4.0 * count / samples);
return 0;
}

Introduction to Parallel Programming

Student Workbook with Instructors Notes
24

Intel Software College

January 2007 Intel Corporation

Lab 5: Implementing Task Decompositions

Time Required

Sixty minutes

Exercise 1
Make this quicksort program parallel by adding the appropriate OpenMP pragmas and clauses.
Compile the program, execute it on 1 and 2 threads, and make sure the program is still correctly
sorting the elements of array A. Finally, compare the execution times of the sequential, singlethreaded, and double-threaded programs.

/*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/

Stack-based Quicksort
The quicksort algorithm works by repeatedly dividing unsorted
sub-arrays into two pieces: one piece containing the smaller
elements and the other piece containing the larger elements.
The splitter element, used to subdivide the unsorted sub-array,
ends up in its sorted location. By repeating this process on
smaller and smaller sub-arrays, the entire array gets sorted.
The typical implementation of quicksort uses recursion. This
implementation replaces recursion with iteration. It manages its
own stack of unsorted sub-arrays. When the stack of unsorted
sub-arrays is empty, the array is sorted.

#include <stdio.h>
#include <stdlib.h>
#define MAX_UNFINISHED 1000

/* Maximum number of unsorted sub-arrays */

/* Global shared variables */

struct {
int first;
int last;
} unfinished[MAX_UNFINISHED];

/* Low index of unsorted sub-array */

/* High index of unsorted sub-array */
/* Stack */

int unfinished_index;

/* Index of top of stack */

float *A;
int
n;

/* Array of elements to be sorted */

/* Number of elements in A */

/* Function 'swap' is called when we want to exchange two array elements */

void swap (float *x, float *y)
{

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
25

Introduction to Parallel Programming

float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
/* Function 'partition' actually does the sorting by dividing an
Unsorted sub-array into two parts: those less than or equal to the
splitter, and those greater than the splitter. The splitter is the
last element in the unsorted sub-array. The splitter ends up in its
final, sorted location. The function returns the final location of
the splitter (its index). This code is an implementation of the
algorithm appearing in Introduction to Algorithms, Second Edition,
by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2001). */
int partition (int first, int last)
{
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
i++;
swap (&A[i], &A[j]);
}
swap (&A[i+1], &A[last]);
return (i+1);
}
/* Function 'quicksort' repeatedly retrieves the indices of unsorted
sub-arrays from the stack and calls 'partition' to divide these
sub-arrays into two pieces. It keeps one of the pieces and puts the
other piece on the stack of unsorted sub-arrays. Eventually it ends
up with a piece that doesn't need to be sorted. At this point it
gets the indices of another unsorted sub-array from the stack. The
function continues until the stack is empty. */
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
/* Split point in array */
while (unfinished_index >= 0) {
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last) {

Introduction to Parallel Programming

Student Workbook with Instructors Notes
26

Intel Software College

January 2007 Intel Corporation

Lab. 5: Implementing Task Decompositions

/* Split unsorted array into two parts */

q = partition (first, last);
/* Put upper portion on stack of unsorted sub-arrays */
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
}
unfinished_index++;
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
/* Keep lower portion for next iteration of loop */
last = q-1;
}
}
}
/* Function 'print_float_array', given the address and length of an
Array of floating-point values, prints the values to standard
output, one element per line. */
void print_float_array (float *A, int n)
{
int i;
printf ("Contents of array:\n");
for (i = 0; i < n; i++)
printf ("%6.4f\n", A[i]);
}
/* Function 'verify_sorted' returns 1 if the elements of array 'A'
are in monotonically increasing order; it returns 0 otherwise. */
int verify_sorted (float *A, int n)
{
int i;
for (i = 0; i < n-1; i++)
if (A[i] > A[i+1]) return 0;
return 1;
}
/* Function 'main' gets the array size and random number seed from
the command line, initializes the array, prints the unsorted array,
sorts the array, and prints the sorted array. */
int main (int argc, char *argv[])
{
int
i;
int
seed;
/* Seed component input by user */
unsigned short xi[3];
/* Random number seed */

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
27

Introduction to Parallel Programming

if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
}
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
print_float_array (A, n);
*/
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0;
quicksort ();
/*
print_float_array (A, n);
*/
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
}

Introduction to Parallel Programming

Student Workbook with Instructors Notes
28

Intel Software College

January 2007 Intel Corporation

Lab 6: Analyzing Parallel Performance

Time Required

Thirty-five minutes

Exercise 1
You are responsible for maintaining a library of core functions used by a wide variety of programs in
an application suite. Your supervisor has noted the availability of multi-core processors and wants to
know whether rewriting the library of functions using threads would significantly improve the
performance of the programs in the application suite. What do you need to do to provide a
meaningful answer?

Exercise 2
Somebody wrote an OpenMP program to solve the problem posed in Lab 5 and benchmarked its
performance sorting 25 million keys. Here are the run times of the program, as reported by the
command-line utility time:

Threads
1
2
3
4

Run Time (sec)

8.535
21.183
22.184
25.060

What is the efficiency of the multithreaded program for 2, 3, and 4 threads? What can you conclude
about the design of the parallel program? Can you offer any suggestions for improving the
performance of the program?

Exercise 3
A co-worker has been working on converting a sequential program into a multithreaded program. At
this point, only some of the functions of the program have been made parallel. On a key data set,
the multithreaded program exhibits these execution times:

Processors
1
2
3
4

Time (sec)
5.34
3.74
3.31
3.10

Is your co-worker on the right track? Would you advise your co-worker to continue the
parallelization effort?

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
29

Introduction to Parallel Programming

Exercise 4
Youve worked hard to convert a key application to multithreaded execution, and youve
benchmarked it on a quad-core processor. Here are the results:

Threads
1
2
3
4

Time (sec)
24.3
14.6
11.7
10.6

Suppose an 8-core version of the processor becomes available.

(a) Predict the execution time of this algorithm on an 8-core processor.
(b) Give a reason why the actual speedup may be lower than expected.
(c) Give two reasons why the actual speedup may be higher than expected.

Exercise 5
You have benchmarked your multithreaded application on a system with CPU A, and it exhibits this
performance:

Threads
1
2
3
4

Time (sec)
14.20
7.81
5.87
4.72

Next you benchmark the same application on an otherwise identical system that has been upgraded
with a newer processor, CPU B, and it exhibits this performance:

Threads
1
2
3
4

Time (sec)
11.83
7.01
5.42
4.59

CPU B is clearly faster than CPU A. The execution times are lower when CPU B is used. However, the
single-processor performance is improved by 20% by using CPU B. In contrast, when four
processors are engaged, the parallel program is only 3% faster. Explain how this can happen.

Introduction to Parallel Programming

Student Workbook with Instructors Notes
30

Intel Software College

January 2007 Intel Corporation

Lab. 6: Analyzing Parallel Performance

Exercise 6
Hard disk drives continue to improve in speed at a slower rate than microprocessors. What are the
implications of this trend for developers of multithreaded applications? What can be done about it?

Intel Software College

January 2007 Intel Corporation
-

Introduction to Parallel Programming

Student Workbook with Instructors Notes
31

Introduction to Parallel Programming

Student Workbook with Instructors Notes
32

Intel Software College

January 2007 Intel Corporation

Lab 7: Improving Parallel Performance

Time Required

Forty-five minutes

Exercise 1
Recall that the parallel quicksort program developed in Lab 5 exhibited poor performance because of
excessive contention among the tasks for access to the shared stack containing the indices of
unsorted sub-arrays. You can dramatically improve the performance by reducing the frequency at
which threads access the shared stack.
One way to reduce accesses to the shared stack is to switch to sequential quicksort for sub-arrays
smaller than a threshold size. In other words, when a thread encounters a sub-array smaller than
the threshold size and partitions it into two pieces, it does not put one piece on the stack and work
on the remaining piece. Instead, it sorts both pieces itself by recursively calling the sequential
quicksort function.
Use this strategy, and the sequential quicksort function given below, to improve the performance of
the parallel quicksort program you developed in Lab 5. Run some experiments to determine the best
threshold size for switching to sequential quicksort.

void seq_quicksort (int first, int last)

{
int q;
/* Split point in array */
if (first < last) {
q = partition (first, last);
seq_quicksort (first, q-1);
seq_quicksort (q+1, last);
}
}

Exercise 2
The following C program counts the number of primes less than n. Use OpenMP pragmas and clauses
to enable it to run on a multiprocessor. Make as many changes as you can in the time allowed to
improve the performance of the program on the maximum available number of processors.

/*
*
*/

This C program counts the number of primes between 2 and n.

#include <stdio.h>
#include <math.h>
#include <omp.h>
/*

Passed a positive integer p, function is_prime returns 1 if

Intel Software College

January 2007 Intel Corporation

Introduction to Parallel Programming

Student Workbook with Instructors Notes
33

Module 1 Introduction to Programming
100% (1)
Module 1 Introduction to Programming
22 pages
Get Computer Organization and Architecture Designing for Performance 10th Edition William Stallings free all chapters
100% (2)
Get Computer Organization and Architecture Designing for Performance 10th Edition William Stallings free all chapters
55 pages
Denso PDF
96% (26)
Denso PDF
36 pages
ICT159 - Unit Information - OUA
No ratings yet
ICT159 - Unit Information - OUA
17 pages
PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Lenguaje Ensamblador. Problemas: Capítulo 1
No ratings yet
Lenguaje Ensamblador. Problemas: Capítulo 1
8 pages
Comila 02 English PDF
100% (1)
Comila 02 English PDF
2 pages
Arkema OP List
No ratings yet
Arkema OP List
2 pages
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
No ratings yet
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
42 pages
CSO Lecture Notes Unit-1
No ratings yet
CSO Lecture Notes Unit-1
33 pages
IT WorkShop Lab Manual
No ratings yet
IT WorkShop Lab Manual
111 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
36 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
Distance Education Council: Handbook On Transformation of Print Materials Into Self Learning Materials
No ratings yet
Distance Education Council: Handbook On Transformation of Print Materials Into Self Learning Materials
44 pages
Microprocessor Based System Lab Manual
No ratings yet
Microprocessor Based System Lab Manual
105 pages
Instant Download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew PDF All Chapters
100% (3)
Instant Download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew PDF All Chapters
40 pages
483327-002 Ic-86-286-386 Compiler Users Guide For DOS Systems 1991 PDF
No ratings yet
483327-002 Ic-86-286-386 Compiler Users Guide For DOS Systems 1991 PDF
484 pages
Soft Computing Assignment
100% (1)
Soft Computing Assignment
13 pages
Debugging, Profiling, Performance Analysis, Optimization PDF
No ratings yet
Debugging, Profiling, Performance Analysis, Optimization PDF
56 pages
Ecs Cse 7thsem Unit 1 For VTU, Belgaum
No ratings yet
Ecs Cse 7thsem Unit 1 For VTU, Belgaum
81 pages
OOPJ Unit 1
No ratings yet
OOPJ Unit 1
24 pages
1-Microcontroller Based System Design - Complete
0% (1)
1-Microcontroller Based System Design - Complete
58 pages
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
No ratings yet
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
79 pages
ES Practical
No ratings yet
ES Practical
25 pages
STM32 Security Workshop 03 SBSFU Presentation
No ratings yet
STM32 Security Workshop 03 SBSFU Presentation
61 pages
Microcontroller and Embedded Systems 21cs43 Mes Vtu Notes 2021
No ratings yet
Microcontroller and Embedded Systems 21cs43 Mes Vtu Notes 2021
221 pages
Address Decoder For PC
No ratings yet
Address Decoder For PC
19 pages
Appu Reoprt
No ratings yet
Appu Reoprt
23 pages
Computer Architecture
No ratings yet
Computer Architecture
4 pages
Computer Architecture Syllabus
No ratings yet
Computer Architecture Syllabus
2 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
64 pages
S.No Topics Lec: Advanced Computer Network ETCS-401
No ratings yet
S.No Topics Lec: Advanced Computer Network ETCS-401
4 pages
Evolution of Microprocessor
No ratings yet
Evolution of Microprocessor
28 pages
Assignment # 3 CHAPTERS# 1,2,3: CH#1 Answers To Review Qestions SECTION 1.1
No ratings yet
Assignment # 3 CHAPTERS# 1,2,3: CH#1 Answers To Review Qestions SECTION 1.1
34 pages
MCSL-017 C and Assembly Language Programming Lab
No ratings yet
MCSL-017 C and Assembly Language Programming Lab
16 pages
Assemly Language 02: To Pay More Attention To Gain Better Result
No ratings yet
Assemly Language 02: To Pay More Attention To Gain Better Result
24 pages
Assignment I: System Calls
No ratings yet
Assignment I: System Calls
21 pages
Unit-6 Ai Tools-Chatgpt
100% (1)
Unit-6 Ai Tools-Chatgpt
9 pages
Chapters 1 and 3: ARM Processor Architecture
No ratings yet
Chapters 1 and 3: ARM Processor Architecture
44 pages
ARM MC Module 03
No ratings yet
ARM MC Module 03
21 pages
Dynamic Random Access Memory
No ratings yet
Dynamic Random Access Memory
20 pages
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
No ratings yet
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
52 pages
CF LAB Manual FINAL
No ratings yet
CF LAB Manual FINAL
66 pages
PPS - Unit 1
No ratings yet
PPS - Unit 1
69 pages
Debugger Ppcqoriq
No ratings yet
Debugger Ppcqoriq
90 pages
Unit-2 Memory Management - Detail
No ratings yet
Unit-2 Memory Management - Detail
81 pages
VHDL Coding Tips and Tricks
No ratings yet
VHDL Coding Tips and Tricks
209 pages
Programming1 Lecture Presentations
No ratings yet
Programming1 Lecture Presentations
124 pages
Assignment 5 - OpenCL Optimizations
100% (1)
Assignment 5 - OpenCL Optimizations
2 pages
OL4B1 Student Exercise
No ratings yet
OL4B1 Student Exercise
38 pages
4 Bit Aritchmatic Logic Unit
No ratings yet
4 Bit Aritchmatic Logic Unit
18 pages
B.Tech-3 Sem (CSE) Subject:: Principles of Programming Languages
No ratings yet
B.Tech-3 Sem (CSE) Subject:: Principles of Programming Languages
24 pages
Guidline For Programing
No ratings yet
Guidline For Programing
16 pages
Embedded System Kerala University Module 1 Notes
100% (1)
Embedded System Kerala University Module 1 Notes
13 pages
Microprocessors & Interfacing
No ratings yet
Microprocessors & Interfacing
43 pages
Semaphores and Monitors
No ratings yet
Semaphores and Monitors
30 pages
Iot Lesson1
100% (1)
Iot Lesson1
38 pages
PDF
100% (2)
PDF
39 pages
Multi2sim Quickstart
No ratings yet
Multi2sim Quickstart
10 pages
Part7_ImprovingParallelPerformance
No ratings yet
Part7_ImprovingParallelPerformance
37 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Action Recognition: Step-by-step Recognizing Actions with Python and Recurrent Neural Network
From Everand
Action Recognition: Step-by-step Recognizing Actions with Python and Recurrent Neural Network
Mark Magic
No ratings yet
PQR P001-24
No ratings yet
PQR P001-24
8 pages
HND34 Sqlassignment
No ratings yet
HND34 Sqlassignment
20 pages
Wear Resistant
No ratings yet
Wear Resistant
2 pages
Medium Voltage Power Fuses: Current Limiting and Expulsion Fuses 2.5-38 KV
No ratings yet
Medium Voltage Power Fuses: Current Limiting and Expulsion Fuses 2.5-38 KV
8 pages
Dol Talab Estimate
No ratings yet
Dol Talab Estimate
17 pages
Toshiba Microcontrollers For Dslrs/Dvcs
No ratings yet
Toshiba Microcontrollers For Dslrs/Dvcs
24 pages
Daftar Pustaka
No ratings yet
Daftar Pustaka
3 pages
CLASS SCHEDULE 1st 2022 2023 FOURTH YEAR
No ratings yet
CLASS SCHEDULE 1st 2022 2023 FOURTH YEAR
4 pages
DSI M11 6 Speed FWD
100% (2)
DSI M11 6 Speed FWD
4 pages
Ball-Valves ISO17292 EN1983515 530 Series PDF
No ratings yet
Ball-Valves ISO17292 EN1983515 530 Series PDF
21 pages
Final Control Elements PDF
100% (2)
Final Control Elements PDF
36 pages
Design & Build of Sindalah Island Breakwater - Proposed Revised TWLA - 24 Jan 2024
No ratings yet
Design & Build of Sindalah Island Breakwater - Proposed Revised TWLA - 24 Jan 2024
2 pages
Ficha Tecnica Sg160 - 200 Kva Data Sheet
No ratings yet
Ficha Tecnica Sg160 - 200 Kva Data Sheet
6 pages
Pleasant Ceiling and Painting
No ratings yet
Pleasant Ceiling and Painting
1 page
Satr A 2009
No ratings yet
Satr A 2009
2 pages
Earthing Bonding Code - 16.01.2020
No ratings yet
Earthing Bonding Code - 16.01.2020
17 pages
BS1747-1 :1969
No ratings yet
BS1747-1 :1969
14 pages
Cryptography & Network Security
No ratings yet
Cryptography & Network Security
25 pages
Crawler Crane Spesification
No ratings yet
Crawler Crane Spesification
7 pages
Electrical Load
No ratings yet
Electrical Load
2 pages
Dolby Atmos Home Theater Installation Guidelines
No ratings yet
Dolby Atmos Home Theater Installation Guidelines
35 pages
Ain Tsila Development Main EPC Contract A-CNT-CON-000-00282: Mar For Paint System 5 C-PIP-REP-000-38156
No ratings yet
Ain Tsila Development Main EPC Contract A-CNT-CON-000-00282: Mar For Paint System 5 C-PIP-REP-000-38156
9 pages
Canada List
No ratings yet
Canada List
14 pages
Compresor ACC Manual
No ratings yet
Compresor ACC Manual
5 pages
FCC Ground Wave Field Strength 550 KHZ To 1650 KHZ Graphs File
No ratings yet
FCC Ground Wave Field Strength 550 KHZ To 1650 KHZ Graphs File
43 pages
MM 2nd Lab Report
No ratings yet
MM 2nd Lab Report
7 pages
List of Requirements - DENR & LLDA Certificates/permits
50% (2)
List of Requirements - DENR & LLDA Certificates/permits
3 pages