Computer Structures - MPI
Computer Structures - MPI
Faculty of Engineering
d) Replace the comment pseudo code in line 45 of the code with a single MPI call
to compute the global sum of all 'successful' dart tosses ensuring that result is
stored in the variable number_in_circle in process 0.
e) Compile and run your code with 109 tosses. Use four processes first. Then use
one. Verify the results are correct. Is it faster with four? If so, why, if not, why not?
(Consider how many physical cores, or hyper threaded ones, your system has) How
many tosses do you need to obtain a stable estimate accurate to five decimal
places?
Global sum function uses tree structured global sum to calculate the sum of the numbers
generated by random function, to the given number of processes. It is consisted with four
parameters.
- Int my_int => An arbitrary positive integer number
- Int my_rank => Current process id value
- Int comm_sz => Total number of processes
- MPI_Comm comm => Communicator
I.e. Calculation of partner Rank, Consider 4 process program execution. If we take my rank to be
equal to 3 the rank of the partner rank would be 2 this is calculated by XOR operation as below
my_rank = 2 (010)
bitmask = 1 (001)
Partner = my_rank ^ bitmask
= 010 ^ 001
= 011
=3
Then it is checked whether my_rank is lower than that of the partner rank and whether it is
lower than the total number of processes, if so it executes an MPI_Recv command, which will
take the value from its partner and add to the my_sum variable. Then the bitmask of that
process was shifted using a left shift, for the next calculation of the tree. In addition, if the
my_rank variable is higher than that of the partner variable the value of the my_sum variable is
sent to its partner process to do summation.
b) Hence, replace the commented pseudo code in lines 76 and 81 each with one MPI call to receive
and send respectively so that received data is stored in recvtemp and sent data is from my_sum.
c) Describe in a few sentences the purpose and execution of the if statement in lines 38-49 in
the code.
If statement will only be executed in the root process.(my_rank = 0 ). Pointers are used
to store and manage the addresses of dynamically allocated blocks of memory. Such blocks are
used to store data objects or arrays of objects. If the array predefined array is used, it might be
insufficient or more than required to hold data. To solve this issue, you can allocate memory
dynamically. Int pointer called “all_ints” have been used to dynamically allocate the memory
and malloc will be allocate the requested size of bytes (size of int * comm_sz) and returns
pointer first byte of allocated space. Then gather function will be called and each process each
“my_int” to send back to process 0 to store all summands in array “all_ints”. Then all the
values which are stored in the “all_ints” array will be printed. Finally the Sum value will be
displayed using the return value observed from the “Global_sum” function and all the
dynamically allocated memory space will be cleared.
d) Hence, replace the commented pseudo code in lines 40 and 48 each with one MPI call so that we
can output our data. Do not use more than one MPI call for each comment line replaced.
e) Compile and run your code with 7, 33 and 101 processes. Verify by other means the results
are correct.
Below calculations were compared with an actual calculation and the results turned out to be the
same.
f) Explain how you would use MPI with a single function call to compute this global sum. Only state
the code you would execute. There is no need to add this to the code but you could use this to
verify your results.
Here, MPI reduce function can be used similar to part 1. It would take all the integer values in
the processes and sum will be calculated in the root process and send the final value into a
buffer.
Part 3:
a) Explain briefly the input and output of the function Floor_log coded at the bottom of
the code file.
b) Hence,
i. Replace the commented pseudo code in lines 81 and 85 each with one MPI
call respectively in order to send the value in my_sum to its partner and to
receive the value from the matching partner in recvtemp.
ii. Replace the commented pseudo code in line 94 with some MPI code making
the current process sends to and receive from its partner such that sent data is
from my_sum and received data is stored in recvtemp while ensuring that
your code will not deadlock.
iii. Replace the commented pseudo code in lines 102 and 106 each with one MPI
call respectively in order to receive the sent value in my_sum and send the
value my_sum to its partner.
c) Describe in a few sentences the purpose and execution of the if statement in lines 39-
55 in the code.
If the (my_rank =! 0), two gather functions will be called to, gather from each process
each my_int to send back to process 0 and store all summands in array all_ints and gather from
each process each sum to send back to process 0 and store each processes' sum in array
sum_proc. This if statement only executed when it is the root function. Dynamically allocated
memory will be used by the int array pointers ‘all_ints’ and ‘sum_proc’. These two arrays will
allocated the space according to the ‘comm_sz’. Gather function call would gather from each
process each my_int value and send back to process 0 to store all summands in array all_ints.
Another Gather function call would gather each process each sum to send back to process 0
and store each processes' sum in array sum_proc. In front of the ‘Ints being summed’
statement all the integer values will be displayed and in front of the ‘Sums on the processes’ it
would displayed the summed values as well. Finally, all the dynamically allocated memory
spaces would be cleared.
d) Hence, replace the commented pseudo code in lines 41 and 53 and lines 46 and 54
each with one MPI call (but matching pairs) to output the data. Do not use more than
one MPI call for each comment line replaced.
e) Compile and run your code with 7, 33 and 101 processes. Verify by other means the
results are correct.
Above calculations were compared with an actual calculation and the results turned out to
be the same.
f. Explain how you would use MPI with a single function call to compute this global sum
as a butterfly. Only state the code you would execute. There is no need to add this to
the code.
This will reduce all the processes into a single process and the my_int variable in every
process will be summed up to sum variable in the resulting single process.
Part 4
a) Briefly explain the functionality of the C++ function clock () and macro
CLOCKS_PER_SEC, and how you would use these to take performance measurements.
Clock program returns the processor time consumed by the program. The value returned is
expressed in clock ticks, which are units of time of a constant but system-specific length (with a
relation of CLOCKS_PER_SEC clock ticks per second). The epoch used as reference by clock
varies between systems, but it is related to the program execution (generally its launch). To
calculate the actual processing time of a program, the value returned by clock shall be
compared to a value returned by a previous call to the same function.
Macro expands to an expression representing the number of clock ticks per second. Clock
ticks are units of time of a constant but system specific length. Since, those are returned by
function clock. Dividing a count of clock ticks by this expression yields the number of seconds.
b) Similarly briefly explain the functionality of the MPI call MPIWtime () and how you
would use it for measuring performance.
MPI_Wtime (), this function returns Time in seconds since an arbitrary time in the past. The
"time in the past" is guaranteed not to change during the life of the process. The user is
responsible for converting large numbers of seconds to other units if they are preferred. The
times returned are local to the node that called them. There is no requirement that different
nodes return "the same time."
Clock ()
The clock function is useless. It measures CPU time, not real time/wall time.
On most implementations, the resolution is extremely bad, for example, 1/100 of a second.
CLOCKS_PER_SECOND is not the resolution, just the scale.
With typical values of CLOCKS_PER_SECOND (UNIX standards require it to be 1 million, for
example), clock will overflow in a matter of minutes on 32-bit systems. After overflow, it
returns -1. Clock () always gives an idea of the expected time.
The output value needed to divide by CLOCKS_PER_SEC to gain the output in seconds.
MPIWtime ()
MPI_Wtime is gives the real time elapsed by each processor.
The function is intended to be a high-resolution.
Clock synchronization can be done using the MPI_WTIME_IS_GLOBAL.
This function is portable than the clock () function explained earlier because it returns seconds,
not the “ticks”. It carries not unnecessary baggage.
MPI_Wtime gives the "current time on this processor", which is quite different. If you do
something that sleeps for 1 minute, MPI_Wtime will move 60 seconds forward, where clock
(except for Windows) would be unchanged.
d) Examine the code for the function ping-pong. Why does the if statement reverse the
order of the MPI send and receive calls for processes 0 and 1?
The ping operation involves in sending a message from process 0 to process 1 and the pong
operation involves in sending back a message from the process 1 to process 0. Therefore, the
message must be first sent from process 0 to process 1, if the process 0 runs this section of
program. At the same time, the process 1 must receive the message sent by process 0.
Similarly, a message must be sent back from process 1 to process 0, if the process 1 runs this
section of program. At the same time, the process 0 must receive the message sent by process
1. Due to these reasons the if statement reverses the order of MPI send and receive functions
for processes 0 and 1.
e) State what message is being sent for the ping-pong and what range of message sizes
are being timed.
f) Hence replace the commented pseudo code in lines 120 and 129 each with one line of
code to calculate and return the value for the total elapsed time for the series of ping-
pongs with clock().
g) Similarly replace the commented pseudo code in lines 118 and 127 each with
one line of code to calculate and return the value total elapsed time for the series of
Ping-Pongs with MPIWtime().
h) Compile and run your code for both methods. Measurements are produced for various
message sizes. Now plot and briefly discuss your results. Are the results for clock ()
reliable? What about MPIWtime ()?
Lab 1 - Part 4
2.50E-02
2.00E-02
1.50E-02
1.00E-02
5.00E-03
0.00E+00
1 4 16 64 256 1024 4096 16384 65536
Series1 Series2
The reason for the value difference of both methods is due to the number of cores assigned to the
Virtual machine, which only has one core out of the two cores the CPU has Therefore, its concluded that
MPIWtime() is more reliable than that of clock ().
Part 5
a. Consider the MPI code provided which is functioning and simply prints one
message per process. Unfortunately, the code produces a slightly different output
each time it is run. Modify the code such that the same message as in the original
code is printed for each process but ensure that they appear in exactly the same order
whenever the code is run. The only constraint is to do so with the minimum possible
changes. Explain your solution.
Parameter of MPI_Recv () was change, from ‘MPI_ANY_SOURCE’ to ‘src’. Thus changed line,
MPI_Recv (msg, MAX_STRING, MPI_CHAR, src, 0, comm, MPI_STATUS_IGNORE);