Advanced Computer Architecture 1
Advanced Computer Architecture 1
OF
ADVANCE COMPUTER ARCHITECTURE
ON
USE OPEN MP OR ANY OTHER APIS TO PARALLELIZE A MATRIX
MULTIPLICATION OF 5000 BY 5000 IN C / C++ FOR A DUAL CORE
MACHINE. USE INTEL TOOLS TO ANALYZE THE CHANGE IN
PERFORMANCE. (FOR MORE THAN ONE STUDENT TOGETHER)
Acknowledgement:
Computer architecture:
Matrix Multiplication:
CREW Matrix Multiplication:
EREW Matrix Multiplication:
Parallel Matrix Multiplication:
Dumb,
Standard,
Single,
Unsafe Single,
Jagged,
Jagged from C++,
Stack Allocated,
Parallel Algorithms for Matrix Multiplication:
Optimizing the matrix multiplication:
Optimizing the Parallel Matrix Multiplication,
References:
ACKNOWLEDGEMENT
The successful completion of any task would be incomplete without mentioning the people
who have made it possible. So it`s with the gratitude that I acknowledge the help, which crowned my
efforts with success.
Life is a process of accumulating and discharging debts, not all of those can be measured. I cannot hope
to discharge them with simple words of thanks but I can certainly acknowledge them.
I owe my gratitude to Ms.Pency Lect. LSM for her constant guidance and support.
I would also like to thank the various department officials and staff who not only provided me with
required opportunity but also extended their valuable time and I have no words to express my
gratefulness to them.
Last but not the least I am very much indebted to my family and friends for their warm encouragement
and moral support in conducting this project work.
NAKUL KUMAR
Computer Architecture:
Computer architecture or Digital computer organization is the conceptual design and fundamental
operational structure of a computer system. It's a blueprint and functional description of requirements
and design implementations for the various parts of a computer, focusing largely on the way by which
the central processing unit (CPU) performs internally and accesses addresses in memory.
Computer architecture comprises at least three main subcategories:
Instruction set architecture: Instruction set architecture is the abstract image of a computing system
that is seen by a machine language programmer, including the instruction set, word size, memory
address modes, processor registers, and address and data formats.
Micro architecture: Micro architecture also known as Computer organization is a lower level, more
concrete and detailed, description of the system that involves how the constituent parts of the system
are interconnected and how they interoperate in order to implement the ISA.
System Design: System Design which includes all of the other hardware components within a
computing system such as:
Matrix Multiplication:
Matrix-matrix multiplication is a fundamental kernel, one which can achieve high efficiency in both
theory and practice. First, some caveats and assumptions:
This material is for dense matrices, ones where there are few zeros and so the matrix is
efficiently stored in a 2D arrays
Distinguish between a matrix and an array. The first is a mathematical object, a rectangular
arrangment of numbers usually indexed by an integer pair (i,j) [that starts indexing from 1,
BTW]. The second term is a computer data structure, which can be used to hold a matrix, and it
might be indexed starting from 0, 1, or anything convenient.
A load-store analysis shows that the memory reference to flop ratio for matrix-matrix multiply
is O(1/n), and hence it should be implementable with near peak performance on a cache-based
serial computer.
The BLAS function for matrix-matrix multiply is dgemm, which is faster to type.
There are "reduced order" algorithms (Strassen, Winograd) which use extra memory but
compute the product in fewer than 2n3 flops. The exponent now is around 2.7. Only the standard
algorithm is considered here because the reduced order techniques can always be applied on a
single process for the parallel versions. Also, the basic idea that BLAS matrix-matrix multiply
has memory reference to flop ratios that go to zero as n increases still holds.
Few modern applications really need matrix-matrix multiplication with dense matrices. It is
more of a toy, and is more of a diagnostic for a system than a useful kernel: if 85% of the
theoretical peak performance cannot be achieved on a machine, then the machine is flawed in
some way: OS, compiler, or hardware.
The program is simply a repeated matrix multiplication of two different 100*100 matrices, with a third
matrix. The multiplication series is repeated 100 times.
That looks more like a reasonable result from Matrix Lab. The matrix multiplication I implemented is a
very fast hack, and it is far from being fast. Obviously, Matrix Lab has efficient multiplication routines
implemented, so even though their virtual machine, or interpreter, is slow as molasses, Matrix Lab is
twice as fast as TONS, on one CPU.
We scale almost linearly in performance, as the extra CPU is taken into use. This is because the amount
of work done in the two independent loops (which of course the loop is transformed into as we add the
second CPU) is the same. No node server waits for the other.
If we added a third CPU, it would never be taken into use. This code just does not parallelize onto three
or more CPUs, with the current state of the TONS parallelize. I do not see an easy way of parallelizing
this program any further, at the virtual machine opcode level, without changing the order in which
things are happening, which we refrain from.
We could however, sometime in the future, implement parallel versions of the instructions, so that if
nodes where available, the matrix multiplication could run in parallel on several nodes. But there are
two things to this. It is not ``automatic parallelization'' in the sense that the virtual machine code is
parallelized, it is simply a matter of exchanging the actual implementations of the instructions with
parallel ones. Secondly, implementing parallel matrix operations in C++ is way beyond the scope of
this work. It is an area in which there has been a lot of research, and it should be fairly simple just to
plug in well-known efficient parallel matrix routines, once we get the actual free-node/busy-node
communication done.
I brushed off some old benchmarking code used in my clustering application and decided to see what I
can do using today’s multi-core hardware. When writing computationally intensive algorithms, we have
a number of considerations to evaluate. The best (IMHO) algorithms to parallelize are data parallel
algorithms without loop carried dependencies.
You may think nothing is special about matrix multiplication, but it actually points out a couple of
performance implications when writing CLR applications. I originally wrote seven different
implementations of matrix multiplication in C# – that’s right, seven.
Dumb
Standard
Single
Unsafe Single
Jagged
Jagged from C++
Stack Allocated
The easiest way to do matrix multiplication is with a .NET multidimensional array with i, j, k ordering
in the loops. The problems are twofold. First, the i, j.k ordering accesses memory in a hectic fashion
causing data in varied locations to be pulled in. Second, it is using a multidimensional array. Yes,
the .NET multidimensional array is convenient, but it is very slow. Let’s look at the C# and IL
C#:
IL of C#:
1: ldloc.s i
2: ldloc.s jcall instance float64& float64[0...,0...]::Address(int32,int32)
4: dup
4: ldobj float64
5: ldloc.1
6: ldloc.s i
7: ldloc.s k
8: call instance float64 float64 [0..., 0...]:: Get (int32, int32)
9: ldloc.2
10: ldloc.s k
11: ldloc.s j
12: call instance float64 float64 [0..., 0...]:: Get(int32, int32)
13: mul
14: add
15: stobj float64
If you notice the :: Address and :: Get parts, these are method calls! Yes, when you use a
multidimensional array, you are using a class instance. So every access, assignment, and read incurs the
cost of a method call. When you are dealing with and N^3 algorithm, that is N^3 method calls making
this implementation much slower than other methods.
Instead of creating a multidimensional array, we create a single block of memory. The float 64 [ ] type
is a block of memory instead of a class. Downside here is that we have to calculate all offsets manually.
This method is the same as the single dimensional array, except that the pointers to the arrays are fixed
and pointers are used in unsafe C#.
This is the same implementation as standard, except that we use arrays of arrays instead of a
multidimensional array. It takes an extra step to initialize, but it is a series of blocks to raw memory
eliminating the method call overhead. It is typically 30% faster that the multidimensional array.
This is a bit more difficult. When writing these algorithms, we let the JIT compiler optimize for us. The
C++ compiler is unfortunately a lot better, but it isn’t real-time. I ported the code from the jagged
implementation to C++/CLI and enabled heavy optimization. Once compiled, I disassembled the dll
and converted the IL to C#. The result is this implementation which is harder to read, but it is really
fast.
This implementation utilizes the rarely used stackalloc keyword. Using this implementation is very
problematic as you may get a Stack Overflow Exception depending on your current stack usage.
A number of algorithms are currently available for multiplying two matrices A and B to yield the
product matrix C = A_B on distributed-memory concurrent computers [12, 16]. Two classic algorithms
are Cannon's algorithm and Fox's algorithm. They are based on a P* P square processor grid with a
block data distribution in which each processor holds a large consecutive block of data.
Optimization:
Optimization activities address performance requirements of the system model. This include changing
algorithms to responds to speed or memory requirements, reducing multiplicities in association to speed
up queries, adding redundant association for efficiency, rearranging execution orders, adding derived
attributes to improve the access time to objects, and opening up the architecture, that is adding access to
lower layer because of performance requirements.
Blocking:
Blocking is a common divide-and-conquer technique for using the memory hierachy effectively. Since
the cache may only be large enough to hold a small piece of one matrix, the data has already been
kicked out of the cache before it is reused. The processor will thus continually be forced to access
slower levels of memory, decreasing the algorithm's performance. With blocking, however, each matrix
is divided into blocks of smaller matrices, and the algorithm multiplies two submatrices, storing their
product before moving on to the next two submatrices. This better exploits cache locality so that data in
the cache can be reused before being replaced.
Copy Optimization:
Copy optimization can help decrease the number of conflict cache misses. As mentioned above,
conflict cache misses occur when multiple data items are mapped to the same location in the cache.
With blocking, this means that cached data may be prematurely kicked out of thconflict misses can
cause severe performance degradation when an array is accessed with a constant, non-unit stride. In the
provided matrix-matrix multiplication implementation, the matrix A is accessed in this way
(specifically, we access A in strides of 'lda'). This is a result of the way the matrix is stored. The
matrices are stored in a one-dimensional array, with the first column occupying the first M entries in
the array, where the matrix is MxM. The second column is stored in the next M entries, and so on.
Thus, consecutive elements in a matrix row are M entries apart in the array, and our matrix
multiplication routine is forced to access the matrix A in an M-unit stride. In order to improve upon
this, we re-order the matrix A so that row elements are stored in consecutive entries in the array (i.e.,
the first row is stored in the first M entries of the array, the second row is stored in the next M entries of
the array, and so on). This re-ordering is sometimes also called copy optimization. Now, both A and B
are accessed in unit-strides.
According to [1], C 'do-while' loops tend to perform better than C 'for' loops because compilers tend to
produce unnecessary loop head branches in 'for' loops. Furthermore, also noted in [1], it is often
cheaper to do equality or inequality tests in loop conditions than magnitude comparision tests. Thus, we
translated the innermost 'for' loop into a 'do-while' loop, and used pointer inequality rather than
magnitude comparision to test for loop termination. The code below exemplifies this technique.
end=&B_j[BLOCK_SIZE];
do {
...
}
while (B_j != end);
Although the compiler option '-funroll-all-loops' is used in the Makefile provided, we decided to see if
unrolling the innermost loop by hand would improve upon the compiler's optimization. According to
[1], explicitly unrolling loops can increase opportunities for other optimizations. The graph below
shows the performance of the matrix-matrix multiply routine with the innermost loop manually
unrolled 2, 3, 4, and 6 times.
Optimizing the Parallel Matrix Multiplication:
Parallel Matrix Multiplication method can help reduce the resource requirements for both memory and
computation. A unique feature of our technique is its formulation of linear recurrences as matrix
computations, before exploiting their mathematical properties for more compact representations. Based
on a general notion of closure for matrix multiplication, we present two classes of matrices that have
compact representations. These classes are permutation matrices and matrices whose elements are
linearly related to each other. To validate the proposed method, we experiment with solving recurrences
whose matrices have compact representations using CUDA on nVidia GeForce 8800 GTX GPU. The
advantages of our technique are that it enables the computation of larger recurrences in parallel and it
provides good speedups of up to eleven times over the un-optimized parallel computations. Also, the
memory usage can be as much as nine times lower than that of the un-optimized parallel computations.
Our result confirms a promising approach for the adoption of more advanced parallelization techniques.
There have been significant developments in the area of distributed and parallel processing. More
powerful and new hardware architectures are being produced at a rapid rate, such as distributed-
memory MIMD computers, which have provided enormous computing power to the software
engineers. These multiprocessors may provide a significant speed-up over the serial execution of an
algorithm. However, this requires careful partitioning and allocation of data and control to the
processor set. Matrix multiplication is a fundamental parallel algorithm which can be effectively
executed on a distributed-memory multiprocessor and can show significant improvement in the speed-
up over the serial execution. Ideally, we should be able to achieve a linear speed up with increase in the
number of processors, but in practice the speed up is much less, and in fact increasing the number of
processors beyond a certain number may result in degradation of the completion time. This degradation
is caused by increased communications between modules.
References:
http://www.cs.wisc.edu/arch/www/people.html,
http://www-unix.mcs.anl.gov/dbpp/text/node45.html,
http://www.codeproject.com/useritems/System_Design.asp,
http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-
parallel-library-tpl/trackback/,
http://www.roseindia.net/.../Java...Optimizing-Parallel.../Retrieval.html
http://www.informaworld.com › ... › Resources › Newsletter
http://www.informaworld.com/smpp/content~content=a772397562
http://www.sciencedirect.com/science