Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DC Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

New trends in Synchronization and related issues in Distributed Systems

Matrix Bidiagonalization on the Trident Processor

The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the
LAPACK subroutine _GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the
Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively
perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP
is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the
Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32K 􀁵32K matrix and 128 Trident lanes, the
speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector
operations. However, using matrix operations on the _GEBRD subroutine gives speedup around 3 times over vector operations,
and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.

The reduction to bidiagonal form is an important first step in the computation of the extremely useful singular value
decomposition of matrices. There are several algorithms to bidiagonalizing a given matrix. Among them, the classic Golub and
Kahan Householder bidiagonalization algorithm is the most popular This algorithm reduces the input matrix to a bidiagonal form
by continuously applying Householder transformations on columns and rows alternatively. This technique is rich in the Level 􀀃2
BLAS [2] of matrix-vector multiplications and outer product updates. For a large size of input matrix, only one add and multiply
are performed for each matrix element. This means that the total execution time can be dominated by the amount of memory
traffic rather than by the floating-point operations (FLOPs) involved.
The movement of data between memory and registers can be costly as arithmetic operations on the data.
LAPACK algorithms are arranged to reuse data so that many FLOPs are performed for every transfer of data from main memory
[3]. In the LAPACK routine _GEBRD, which is used for reduction of a matrix to bidiagonal form, half the operations are
Level􀀃3 BLAS [4,􀀃5] of matrix-matrix multiplications and roughly the other half are in the Level 􀀃2 and Level􀀃1 BLAS of
matrixvector multiplications, scalar-vector multiplication, dotproduct, and SAXPY. The _GEBRD routine is based on the block
Householder representation. The block Householder applies clusters of Householder transformations based on Level 􀀃3 BLAS.
This paper describes the implementation and evaluation of the standard and the blocked algorithms for a matrix bidiagonalization
using Householder transformations on the Trident processor.
Trident processor [6, 7] has three levels of instruction set architecture (scalar, vector, and matrix) to express parallelism to
hardware instead of the dynamical extraction by a complicated logic (superscalar
architectures [8]) or statically with compilers (VLIW architectures [9]). Like vector architectures [10, 11], Trident processor
extends a scalar core with parallel lanes; each lane contains a set of vector pipelines and a slice of register file. However, Trident
processor can effectively process not only vector but also matrix data on the parallel lanes. Using high-level vector and matrix
instructions provides a higher interface for programming and expressing parallelism to hardware, which leads to high
performance, simple programming model, and compact executable code. Using high-level vector and matrix instructions is to
avoid unnecessary memory references

Vector/matrix unit
Trident processor architecture
Trident processor extends a scalar core with a vector/matrix unit shown in Figure 1 to process vector and matrix data. The
vector/matrix unit consists of Pparallel lanes. Each lane contains an execution datapath in form of vector pipelines and a slice of
ring and communication register file. The execution datapath is capable of executing fundamental arithmetic/logic operations on
1-D array of data at the rate one element
per clock cycle. Each datapath receives identical control but different input elements in each clock cycle. Beside the execution
datapath, each lane has a set of ring registers based on local communication to store and cyclical shift 1-D data. Each ring register
has only one port for reading and another for writing 1-D data. While a 1-D cyclical shift is sufficient for many element-wise
vector and matrix operations, there are others, such as reduction operations, which require full connection between all parallel
lanes. The vector/matrix unit has a set of communication registers to store and cyclical shift 1-D data across the parallel lanes.

2. A dual-time vector clock based synchronization mechanism for key-value data in the
SILENUS file system

Abstract—The SILENUS federated file system was developed by the SORCER research group at Texas Tech University. The
federated file system with its dynamic nature does not require any configuration by the end users and system administrators. The
SILENUS file system provides support for disconnected operation. To support disconnected operation a relevant synchronization
mechanism is needed. This mechanism must detect and order events properly. It must detect also possible conflicts and resolve
these in a consistent manner. This paper describes the new synchronization mechanism needed for providing data consistency. It
introduces dual-time vector clocks to order events and detect conflicts. A conflict resolution algorithm is defined that does not
require user interactions. It introduces the switchback problem and how it can be avoided. The synchronization mechanisms
presented in this paper can be adapted to synchronize any key-value based data in any distributed system.

The SILENUS File System


The SILENUS file system provides a grid data storage solution using loosely coupled replicated services. Each service is run
independently on any number of machines. Services can discover themselves dynamically. These services federate together to
provide one storage system to the user. Two of these services are byte store and metadata store. We assume that metadata about
files is relatively small, with respect to large content of files. A byte store service stores the actual file data. In the basic hardware
analogy this would be the actual hard drive. Files in a byte store are identified uniquely by the ID of the byte store and an entry
ID in the byte store. These ID numbers never change. This makes the file storage independent from file metadata such as the file
name. The byte store services provide nothing but support for file storage. The advantage is that this service can be then
optimized for performance. A metadata store provides attributes for the files stored in a file system. By analogy to a traditional
storage system a metadata store can be considered itself as the file system. The metadata information creates the well-known
hierarchical structure. Files in the metadata store are identified by universally unique dentifiers (Uuid). The metadata provides
mapping from and to file names. Metadata stores are synchronized while connected. All metadata stores contain the same
information. Should a metadata store be disconnected while its information changes, it will be resynchronized when it discovers
the other operational metadata stores after the reconnection. This paper describes the synchronization mechanism.
Other notable services are the SILENUS facade service and the legacy adapters. The SILENUS facade service provides an

You might also like