Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
107 views

Advanced Computer Architecture 1

The document discusses parallelizing a matrix multiplication of two 5000x5000 matrices using OpenMP or other APIs on a dual core machine. It asks students to implement this in C/C++ and use Intel tools to analyze the change in performance from parallelizing the computation. The task is intended to be completed by more than one student working together.

Uploaded by

anilbittu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Advanced Computer Architecture 1

The document discusses parallelizing a matrix multiplication of two 5000x5000 matrices using OpenMP or other APIs on a dual core machine. It asks students to implement this in C/C++ and use Intel tools to analyze the change in performance from parallelizing the computation. The task is intended to be completed by more than one student working together.

Uploaded by

anilbittu
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

TERM PAPER

OF
ADVANCE COMPUTER ARCHITECTURE

ON
USE OPEN MP OR ANY OTHER APIS TO PARALLELIZE A MATRIX
MULTIPLICATION OF 5000 BY 5000 IN C / C++ FOR A DUAL CORE
MACHINE. USE INTEL TOOLS TO ANALYZE THE CHANGE IN
PERFORMANCE. (FOR MORE THAN ONE STUDENT TOGETHER)

Submitted To: Submitted BY:


Pency meam Nakul Kumar
Roll no: A (07)
Reg. no: 10901295
Sec. no: RS1906
TABLE OF CONTENT:

 Acknowledgement:
 Computer architecture:
 Matrix Multiplication:
 CREW Matrix Multiplication:
 EREW Matrix Multiplication:
 Parallel Matrix Multiplication:
 Dumb,
 Standard,
 Single,
 Unsafe Single,
 Jagged,
 Jagged from C++,
 Stack Allocated,
 Parallel Algorithms for Matrix Multiplication:
 Optimizing the matrix multiplication:
 Optimizing the Parallel Matrix Multiplication,
 References:
ACKNOWLEDGEMENT

The successful completion of any task would be incomplete without mentioning the people
who have made it possible. So it`s with the gratitude that I acknowledge the help, which crowned my
efforts with success.

Life is a process of accumulating and discharging debts, not all of those can be measured. I cannot hope
to discharge them with simple words of thanks but I can certainly acknowledge them.

I owe my gratitude to Ms.Pency Lect. LSM for her constant guidance and support.

I would also like to thank the various department officials and staff who not only provided me with
required opportunity but also extended their valuable time and I have no words to express my
gratefulness to them.

Last but not the least I am very much indebted to my family and friends for their warm encouragement
and moral support in conducting this project work.

NAKUL KUMAR

Computer Architecture:

Computer architecture or Digital computer organization is the conceptual design and fundamental
operational structure of a computer system. It's a blueprint and functional description of requirements
and design implementations for the various parts of a computer, focusing largely on the way by which
the central processing unit (CPU) performs internally and accesses addresses in memory.
Computer architecture comprises at least three main subcategories:

Instruction set architecture: Instruction set architecture is the abstract image of a computing system
that is seen by a machine language programmer, including the instruction set, word size, memory
address modes, processor registers, and address and data formats.

Micro architecture: Micro architecture also known as Computer organization is a lower level, more
concrete and detailed, description of the system that involves how the constituent parts of the system
are interconnected and how they interoperate in order to implement the ISA.

System Design: System Design which includes all of the other hardware components within a
computing system such as:

 System interconnects such as computer buses and switches


 Memory controllers and hierarchies
 CPU off-load mechanisms such as direct memory access (DMA)
 Issues like multiprocessing.

There are many types of computer architectures:

 Quantum computer v/s Chemical computer,


 Scalar processor v/s Vector processor,
 Non-Uniform Memory Access (NUMA) computers,
 Register machine v/s Stack machine,
 Harvard architecture v/s von Neumann architecture,
 Cellular architecture,

Matrix Multiplication:

Matrix-matrix multiplication is a fundamental kernel, one which can achieve high efficiency in both
theory and practice. First, some caveats and assumptions:
 This material is for dense matrices, ones where there are few zeros and so the matrix is
efficiently stored in a 2D arrays
 Distinguish between a matrix and an array. The first is a mathematical object, a rectangular
arrangment of numbers usually indexed by an integer pair (i,j) [that starts indexing from 1,
BTW]. The second term is a computer data structure, which can be used to hold a matrix, and it
might be indexed starting from 0, 1, or anything convenient.
 A load-store analysis shows that the memory reference to flop ratio for matrix-matrix multiply
is O(1/n), and hence it should be implementable with near peak performance on a cache-based
serial computer.
 The BLAS function for matrix-matrix multiply is dgemm, which is faster to type.
 There are "reduced order" algorithms (Strassen, Winograd) which use extra memory but
compute the product in fewer than 2n3 flops. The exponent now is around 2.7. Only the standard
algorithm is considered here because the reduced order techniques can always be applied on a
single process for the parallel versions. Also, the basic idea that BLAS matrix-matrix multiply
has memory reference to flop ratios that go to zero as n increases still holds.
 Few modern applications really need matrix-matrix multiplication with dense matrices. It is
more of a toy, and is more of a diagnostic for a system than a useful kernel: if 85% of the
theoretical peak performance cannot be achieved on a machine, then the machine is flawed in
some way: OS, compiler, or hardware.

The program is simply a repeated matrix multiplication of two different 100*100 matrices, with a third
matrix. The multiplication series is repeated 100 times.

This time, I only tested Matrix Lab and TONS.

Language Time (seconds)  


Mat Lab 9 . 2
TONS 20 . 5
TONS 10 . 9

That looks more like a reasonable result from Matrix Lab. The matrix multiplication I implemented is a
very fast hack, and it is far from being fast. Obviously, Matrix Lab has efficient multiplication routines
implemented, so even though their virtual machine, or interpreter, is slow as molasses, Matrix Lab is
twice as fast as TONS, on one CPU.

We scale almost linearly in performance, as the extra CPU is taken into use. This is because the amount
of work done in the two independent loops (which of course the loop is transformed into as we add the
second CPU) is the same. No node server waits for the other.
If we added a third CPU, it would never be taken into use. This code just does not parallelize onto three
or more CPUs, with the current state of the TONS parallelize. I do not see an easy way of parallelizing
this program any further, at the virtual machine opcode level, without changing the order in which
things are happening, which we refrain from.

We could however, sometime in the future, implement parallel versions of the instructions, so that if
nodes where available, the matrix multiplication could run in parallel on several nodes. But there are
two things to this. It is not ``automatic parallelization'' in the sense that the virtual machine code is
parallelized, it is simply a matter of exchanging the actual implementations of the instructions with
parallel ones. Secondly, implementing parallel matrix operations in C++ is way beyond the scope of
this work. It is an area in which there has been a lot of research, and it should be fairly simple just to
plug in well-known efficient parallel matrix routines, once we get the actual free-node/busy-node
communication done.

Matrix Multiplication Algorithm:


The product of an m x n matrix A and n x k matrix of B is M x K matrix of C whose elements
are:
Cij = Σ ais * bsj; s = 1….n
Procedure Matrix Multiplication
For I: = 1 to m do
For j: = 1 to k do
Cij = 0
For s := 1 to n do
Cij = Cij + ais * bsj;
End for
End for
End for
CREW Matrix Multiplication
The algorithm uses n2 processors which are arranged in a 2d array of size n x n.
Overall complexity is O (n).
Procedure CREW Matrix Multiplication
For I: = 1 to n do in parallel
For j: = 1 to n do in parallel
Ci, j = 0
For k: = 1 to n do
Ci, j = Ci, j + ai, k * bk, j;
End for
End for
End for
EREW Matrix Multiplication:
In case of CREW model one advantage is that a memory location can be accessed by any other
Processor. In EREW model one needs to ensure that every processor reads the value from a
memory location which is not being accessed by any other processor.
Procedure EREW Matrix Multiplication
For I: = 1 to n do in parallel
For j: = 1 to n do in parallel
Ci, j = 0
For k: = 1 to n do
lk: = (i+j+k) mod n+1;
Ci, j = Ci, j + ai, 1k * b1k, j;
End for
End for
End for
CRCW Matrix Multiplication:
The algorithm uses n 3 processors and runs O (1) time. When more than one processor attempts to
write to the same memory location, the sum of the values is written onto the memory location.
For I: = 1 to n do in parallel
For j: = 1 to n do in parallel
For s: = 1 to n do in parallel
Ci, j = 0
Ci, j = Ci, j + ai, s * bs, j;
End for
End for
End for

Parallel Matrix Multiplication:

I brushed off some old benchmarking code used in my clustering application and decided to see what I
can do using today’s multi-core hardware. When writing computationally intensive algorithms, we have
a number of considerations to evaluate. The best (IMHO) algorithms to parallelize are data parallel
algorithms without loop carried dependencies.

You may think nothing is special about matrix multiplication, but it actually points out a couple of
performance implications when writing CLR applications. I originally wrote seven different
implementations of matrix multiplication in C# – that’s right, seven.

 Dumb
 Standard
 Single
 Unsafe Single
 Jagged
 Jagged from C++
 Stack Allocated

Dumb: double [N, N], real type: float64 [0..., 0...]

The easiest way to do matrix multiplication is with a .NET multidimensional array with i, j, k ordering
in the loops. The problems are twofold. First, the i, j.k ordering accesses memory in a hectic fashion
causing data in varied locations to be pulled in. Second, it is using a multidimensional array. Yes,
the .NET multidimensional array is convenient, but it is very slow. Let’s look at the C# and IL

C#:

1: C[i, j] += A[i, k] * B[k, j];

IL of C#:

1: ldloc.s i
2: ldloc.s jcall instance float64& float64[0...,0...]::Address(int32,int32)
4: dup
4: ldobj float64
5: ldloc.1
6: ldloc.s i
7: ldloc.s k
8: call instance float64 float64 [0..., 0...]:: Get (int32, int32)
9: ldloc.2
10: ldloc.s k
11: ldloc.s j
12: call instance float64 float64 [0..., 0...]:: Get(int32, int32)
13: mul
14: add
15: stobj float64

If you notice the :: Address and :: Get parts, these are method calls! Yes, when you use a
multidimensional array, you are using a class instance. So every access, assignment, and read incurs the
cost of a method call. When you are dealing with and N^3 algorithm, that is N^3 method calls making
this implementation much slower than other methods.

Standard: double [N,N], real type float64[0..., 0...]


This implementation rearranges the loop ordering to i,k,j in order to optimize memory access to the
arrays. No other changes are made from the dumb implementation. The Standard implementation is
what is used for the base of all other multidimensional implementations.

Single: double [N * N], real type float64 [ ]

Instead of creating a multidimensional array, we create a single block of memory. The float 64 [ ] type
is a block of memory instead of a class. Downside here is that we have to calculate all offsets manually.

Unsafe Single, real type float 64[ ]

This method is the same as the single dimensional array, except that the pointers to the arrays are fixed
and pointers are used in unsafe C#.

Jagged, double [N][N], real type float 64[ ][ ]

This is the same implementation as standard, except that we use arrays of arrays instead of a
multidimensional array. It takes an extra step to initialize, but it is a series of blocks to raw memory
eliminating the method call overhead. It is typically 30% faster that the multidimensional array.

Jagged from C++, double [N][N], real type float 64[ ][ ]

This is a bit more difficult. When writing these algorithms, we let the JIT compiler optimize for us. The
C++ compiler is unfortunately a lot better, but it isn’t real-time. I ported the code from the jagged
implementation to C++/CLI and enabled heavy optimization. Once compiled, I disassembled the dll
and converted the IL to C#. The result is this implementation which is harder to read, but it is really
fast.

Stack Allocated, stackalloc double [N * N], real type float 64*

This implementation utilizes the rarely used stackalloc keyword. Using this implementation is very
problematic as you may get a Stack Overflow Exception depending on your current stack usage.

Parallel Algorithms for Matrix Multiplication:

The matrix multiplication algorithm, called DIMMA (Distribution-Independent Matrix Multiplication


Algorithm), for block cyclic data distribution on distributed-memory concurrent computers. The
algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap
computation and communication effectively, and exploits the LCM block concept to obtain the
maximum performance of the sequential BLAS routine in each processor even when the block size is
very small as well as very large. The algorithm is implemented and compared with SUMMA on the
Intel Paragon computer.
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For
arbitrarily large number of processors, any of these algorithms or their variants can provide near linear
speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be
superior than others. In this paper we analyze the performance and scalability of a number of parallel
formulations of the matrix multiplication algorithm and predict the conditions under which each
formulation is better than the others. We present a parallel formulation for hypercube and related
architectures that performs better than any of the schemes described in the literature so far for a wide
range of matrix sizes and number of processors. The superior performance and the analytical scalability
expressions for this algorithm are verified through experiments on the Thinking Machines
Corporation's CM-5 TM y parallel computer for up to 512 processors.

A number of algorithms are currently available for multiplying two matrices A and B to yield the
product matrix C = A_B on distributed-memory concurrent computers [12, 16]. Two classic algorithms
are Cannon's algorithm and Fox's algorithm. They are based on a P* P square processor grid with a
block data distribution in which each processor holds a large consecutive block of data.

Optimization:

Optimization activities address performance requirements of the system model. This include changing
algorithms to responds to speed or memory requirements, reducing multiplicities in association to speed
up queries, adding redundant association for efficiency, rearranging execution orders, adding derived
attributes to improve the access time to objects, and opening up the architecture, that is adding access to
lower layer because of performance requirements.

Optimizing the matrix multiplication:


In the past few years, there have been significant developments in the area of distributed and parallel
processing. More powerful and new hardware architectures are being produced at a rapid rate, such as
distributed-memory MIMD computers, which have provided enormous computing power to the
software engineers. These multiprocessors may provide a significant speed-up over the serial execution
of an algorithm. However, this requires careful partitioning and allocation of data and control to the
processor set. Matrix multiplication is a fundamental parallel algorithm which can be effectively
executed on a distributed-memory multiprocessor and can show significant improvement in the speed-
up over the serial execution. Ideally, we should be able to achieve a linear speed up with increase in the
number of processors, but in practice the speed up is much less, and in fact increasing the number of
processors beyond a certain number may result in degradation of the completion time. This degradation
is caused by increased communications between modules. Therefore, the optimum speed-up is a
function of the number of processors and the communication cost. To find the optimum performance, a
user needs to experiment with all the available processors on a multiprocessor.
In this paper, we studied the detailed performance of the parallel matrix multiplication algorithm. The
study defines the factors that control the performance of this class of algorithms and shows how to use
these factors to optimize the algorithm's execution time. Also, an analytic approach is described which
can eliminate a trial and effort method to actually determine the size of processor set.

Memory Hierarchy Optimizations

Blocking:

Blocking is a common divide-and-conquer technique for using the memory hierachy effectively. Since
the cache may only be large enough to hold a small piece of one matrix, the data has already been
kicked out of the cache before it is reused. The processor will thus continually be forced to access
slower levels of memory, decreasing the algorithm's performance. With blocking, however, each matrix
is divided into blocks of smaller matrices, and the algorithm multiplies two submatrices, storing their
product before moving on to the next two submatrices. This better exploits cache locality so that data in
the cache can be reused before being replaced.

Copy Optimization:

Copy optimization can help decrease the number of conflict cache misses. As mentioned above,
conflict cache misses occur when multiple data items are mapped to the same location in the cache.
With blocking, this means that cached data may be prematurely kicked out of thconflict misses can
cause severe performance degradation when an array is accessed with a constant, non-unit stride. In the
provided matrix-matrix multiplication implementation, the matrix A is accessed in this way
(specifically, we access A in strides of 'lda'). This is a result of the way the matrix is stored. The
matrices are stored in a one-dimensional array, with the first column occupying the first M entries in
the array, where the matrix is MxM. The second column is stored in the next M entries, and so on.
Thus, consecutive elements in a matrix row are M entries apart in the array, and our matrix
multiplication routine is forced to access the matrix A in an M-unit stride. In order to improve upon
this, we re-order the matrix A so that row elements are stored in consecutive entries in the array (i.e.,
the first row is stored in the first M entries of the array, the second row is stored in the next M entries of
the array, and so on). This re-ordering is sometimes also called copy optimization. Now, both A and B
are accessed in unit-strides.

Inner Loop Optimizations


optimizations should focus on the places in the code where the most time is spent. In our matrix-matrix
multiplication implementation, this is the innermost loop.

Minimized Branches and Avoidance of Magnitude Compares

According to [1], C 'do-while' loops tend to perform better than C 'for' loops because compilers tend to
produce unnecessary loop head branches in 'for' loops. Furthermore, also noted in [1], it is often
cheaper to do equality or inequality tests in loop conditions than magnitude comparision tests. Thus, we
translated the innermost 'for' loop into a 'do-while' loop, and used pointer inequality rather than
magnitude comparision to test for loop termination. The code below exemplifies this technique.

The original code that looks something like this:

for (k = 0; k < BLOCK_SIZE; k++) { ... }

is translated into something like this:

end=&B_j[BLOCK_SIZE];
do {
...
}
while (B_j != end);

Explicit Loop Unrolling

Although the compiler option '-funroll-all-loops' is used in the Makefile provided, we decided to see if
unrolling the innermost loop by hand would improve upon the compiler's optimization. According to
[1], explicitly unrolling loops can increase opportunities for other optimizations. The graph below
shows the performance of the matrix-matrix multiply routine with the innermost loop manually
unrolled 2, 3, 4, and 6 times.
Optimizing the Parallel Matrix Multiplication:
Parallel Matrix Multiplication method can help reduce the resource requirements for both memory and
computation. A unique feature of our technique is its formulation of linear recurrences as matrix
computations, before exploiting their mathematical properties for more compact representations. Based
on a general notion of closure for matrix multiplication, we present two classes of matrices that have
compact representations. These classes are permutation matrices and matrices whose elements are
linearly related to each other. To validate the proposed method, we experiment with solving recurrences
whose matrices have compact representations using CUDA on nVidia GeForce 8800 GTX GPU. The
advantages of our technique are that it enables the computation of larger recurrences in parallel and it
provides good speedups of up to eleven times over the un-optimized parallel computations. Also, the
memory usage can be as much as nine times lower than that of the un-optimized parallel computations.
Our result confirms a promising approach for the adoption of more advanced parallelization techniques.

There have been significant developments in the area of distributed and parallel processing. More
powerful and new hardware architectures are being produced at a rapid rate, such as distributed-
memory MIMD computers, which have provided enormous computing power to the software
engineers. These multiprocessors may provide a significant speed-up over the serial execution of an
algorithm. However, this requires careful partitioning and allocation of data and control to the
processor set. Matrix multiplication is a fundamental parallel algorithm which can be effectively
executed on a distributed-memory multiprocessor and can show significant improvement in the speed-
up over the serial execution. Ideally, we should be able to achieve a linear speed up with increase in the
number of processors, but in practice the speed up is much less, and in fact increasing the number of
processors beyond a certain number may result in degradation of the completion time. This degradation
is caused by increased communications between modules.
References:

 http://www.cs.wisc.edu/arch/www/people.html,
 http://www-unix.mcs.anl.gov/dbpp/text/node45.html,
 http://www.codeproject.com/useritems/System_Design.asp,
 http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-
parallel-library-tpl/trackback/,
 http://www.roseindia.net/.../Java...Optimizing-Parallel.../Retrieval.html
 http://www.informaworld.com › ... › Resources › Newsletter
 http://www.informaworld.com/smpp/content~content=a772397562
 http://www.sciencedirect.com/science

You might also like