DC Assignment
DC Assignment
DC Assignment
The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the
LAPACK subroutine _GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the
Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively
perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP
is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the
Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32K 32K matrix and 128 Trident lanes, the
speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector
operations. However, using matrix operations on the _GEBRD subroutine gives speedup around 3 times over vector operations,
and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
The reduction to bidiagonal form is an important first step in the computation of the extremely useful singular value
decomposition of matrices. There are several algorithms to bidiagonalizing a given matrix. Among them, the classic Golub and
Kahan Householder bidiagonalization algorithm is the most popular This algorithm reduces the input matrix to a bidiagonal form
by continuously applying Householder transformations on columns and rows alternatively. This technique is rich in the Level 2
BLAS [2] of matrix-vector multiplications and outer product updates. For a large size of input matrix, only one add and multiply
are performed for each matrix element. This means that the total execution time can be dominated by the amount of memory
traffic rather than by the floating-point operations (FLOPs) involved.
The movement of data between memory and registers can be costly as arithmetic operations on the data.
LAPACK algorithms are arranged to reuse data so that many FLOPs are performed for every transfer of data from main memory
[3]. In the LAPACK routine _GEBRD, which is used for reduction of a matrix to bidiagonal form, half the operations are
Level3 BLAS [4,5] of matrix-matrix multiplications and roughly the other half are in the Level 2 and Level1 BLAS of
matrixvector multiplications, scalar-vector multiplication, dotproduct, and SAXPY. The _GEBRD routine is based on the block
Householder representation. The block Householder applies clusters of Householder transformations based on Level 3 BLAS.
This paper describes the implementation and evaluation of the standard and the blocked algorithms for a matrix bidiagonalization
using Householder transformations on the Trident processor.
Trident processor [6, 7] has three levels of instruction set architecture (scalar, vector, and matrix) to express parallelism to
hardware instead of the dynamical extraction by a complicated logic (superscalar
architectures [8]) or statically with compilers (VLIW architectures [9]). Like vector architectures [10, 11], Trident processor
extends a scalar core with parallel lanes; each lane contains a set of vector pipelines and a slice of register file. However, Trident
processor can effectively process not only vector but also matrix data on the parallel lanes. Using high-level vector and matrix
instructions provides a higher interface for programming and expressing parallelism to hardware, which leads to high
performance, simple programming model, and compact executable code. Using high-level vector and matrix instructions is to
avoid unnecessary memory references
Vector/matrix unit
Trident processor architecture
Trident processor extends a scalar core with a vector/matrix unit shown in Figure 1 to process vector and matrix data. The
vector/matrix unit consists of Pparallel lanes. Each lane contains an execution datapath in form of vector pipelines and a slice of
ring and communication register file. The execution datapath is capable of executing fundamental arithmetic/logic operations on
1-D array of data at the rate one element
per clock cycle. Each datapath receives identical control but different input elements in each clock cycle. Beside the execution
datapath, each lane has a set of ring registers based on local communication to store and cyclical shift 1-D data. Each ring register
has only one port for reading and another for writing 1-D data. While a 1-D cyclical shift is sufficient for many element-wise
vector and matrix operations, there are others, such as reduction operations, which require full connection between all parallel
lanes. The vector/matrix unit has a set of communication registers to store and cyclical shift 1-D data across the parallel lanes.
2. A dual-time vector clock based synchronization mechanism for key-value data in the
SILENUS file system
Abstract—The SILENUS federated file system was developed by the SORCER research group at Texas Tech University. The
federated file system with its dynamic nature does not require any configuration by the end users and system administrators. The
SILENUS file system provides support for disconnected operation. To support disconnected operation a relevant synchronization
mechanism is needed. This mechanism must detect and order events properly. It must detect also possible conflicts and resolve
these in a consistent manner. This paper describes the new synchronization mechanism needed for providing data consistency. It
introduces dual-time vector clocks to order events and detect conflicts. A conflict resolution algorithm is defined that does not
require user interactions. It introduces the switchback problem and how it can be avoided. The synchronization mechanisms
presented in this paper can be adapted to synchronize any key-value based data in any distributed system.