Multithreaded Algorithms For The Fast Fourier Transform
Multithreaded Algorithms For The Fast Fourier Transform
ABSTRACT
In this paper we present ne-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the data ow concepts. The rst approach, referred to as the receiver-initiated algorithm, realizes the FFT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data- ow model based on the producerconsumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (E cient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on xed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Categories and Subject Descriptors
F.2 Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems|complexity measures, performance measures
General Terms
Fine-Grained, Multithreading, Data ow Architecture, Parallel Algorithms, Non-Preemptive Department of Electrical and Computer Engineering, 140 Evans Hall, University of Delaware, Newark, DE 19716. Email: fthulasir@capsl.udel.edu,
theobald@capsl.udel.edu, ashfaq@eecis.udel.edu, ggao@capsl.udel.edug.
Traditionally, digital image/signal processing algorithms are computationally intensive due to the large amount of data involved in the underlying applications. For example, a typical multispectral image may have a resolution of 8192 8192 pixels with 8 bits per pixel and 125 spectral bands, resulting in a spatial data set containing more than 8 Gbytes for each scene. Similarly, application of inverse scattering techniques to obtain material properties of the objects in a target image involves solving large sparse system of linear equations where matrices typically grow as big as 100 000 100 000. Performing transform operations such as FFT, DCT, or Wavelet, in real time on such large data sets requires high performance computing 20, 19, 11]. The Fast Fourier Transform (FFT) 5] has been studied extensively as a frequency analysis tool in diverse application areas such as audio, signal, and image processing 18], and several other real time data applications 7, 16]. In general the FFT based frequency analysis of a multidimensional data set can be realized by performing 1D-FFT alternately on each dimension of the data interleaved with data transpose steps. The 1-D FFT on an input of N data points requires (N/2) 2 (N/2) complex multiplication operations which takes most of the computation time for large data sets. The FFT problem has been studied on various parallel machines 13]. It can be well parallelized using shu e exchange network 2, 23]. Other parallel implementations have been performed on linear arrays 25], hypercubes 1, 2], and mesh architectures 12, 11]. Two types of latencies are normally embedded in parallel implementations: communication latency (due to remote accesses) and synchronization latency (due to data dependencies) 10]. Conventional message passing MPPs do not yield high performance if such latencies are frequent in the parallel solutions employed. Several techniques at the software and hardware levels (such as superscalar, superpipelined, VLIW, prefetching) 8] have been used to hide or tolerate both communication and synchronization latencies. But the most general technique is multithreading. Multithreading tries to overlap computation with communication by means of threads (a thread is a set of instructions executed sequentially) thereby tolerating latencies. This paper investigates multithreaded algorithms and implementations for Fast Fourier Transform (FFT) on ne-grained multithreaded computing paradigms. We de ne a ne-grained multithreaded paradigm as the one that has abundant number of threads and the cost of switching between threads is minimal 24]. The imbalance in computation and communication in a parallel FFT algorithm and the global nature
; ; log
1. INTRODUCTION
of the embedded communication patterns makes it an ideal candidate for multithreaded platforms. Sohn et. al. 22] have studied the FFT problem on EM-X multithreaded architecture. Given N points (N is a power of 2) and P processors, N/P points are partitioned and distributed to each of the processors. The iterative FFT algorithm is implemented on each processor by creating h threads in each processor to handle N/P points. Each thread operates on (N/Ph) points. It is claimed that on the EM-X architecture, 2 to 3 threads perform the best overlap with communication. Matteo Frigo and Steven Johnson 6] have developed a set of library C functions called codelets to compute the DFT for arbitrary image size and real or complex numbers. A compiler called gen t has been developed that takes the input N at compile time and generates a set of optimized codelets to calculate the DFT for N points. At runtime, they use a dynamic programming algorithm to determine the best set of codelets to execute. The algorithm is portable and adaptable on various architectures. A multithreaded version of the Cooley-Tukey DFT algorithm using the divide and conquer approach has also been developed using the multithreaded language Cilk 15]. In this paper, we study the FFT problem on non-preemptive multithreaded architectures. In a non-preemptive model a thread once started runs to completion. Assuming this model of computation, we develop two di erent data ow style algorithms for the FFT problem. The rst algorithm is a ne grained algorithm referred to as the receiver-initiated algorithm. In this algorithm a parent-child relationship is established between threads, while fully exploiting the underlying parallelism in the FFT problem. The number of threads scales linearly with the product of input size and the number of processors. The second algorithm, referred to as the sender-initiated algorithm is a coarse-grained algorithm, where the number of threads can be set based on the architectural characteristics of the target platform. This algorithm models the FFT problem as a producer-consumer problem and can be adopted to di erent architectural parameters for achieving high performance. We have used the EARTH- E cient Architecture for Running THreads 9, 24] platform for our implementations, which is a ne-grained, non-preemptive, data ow architecture. We present analytical and experimental results for both the algorithms. Our implementation show that for certain block sizes on xed problem size and machine size, the receiverinitiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed. For reference purposes, we have also implemented the best known sequential algorithm for the FFT on a single node MANNA. The algorithm takes 866ms for performing FFT on 216 data points on an i860 processor 24]. The rest of the paper is organized as follows: Section 2 presents the multithreaded algorithms. Analytical results are presented in Section 3. The experimental framework of the EARTH model is given in Section 4. Performance results of the algorithms are presented in Section 5. Our
The FFT problem may be solved recursively or iteratively. In general, iterative version of the FFT algorithm is more suitable for distributed memory parallel machines 13, 14]. In the following we present two data ow style multithreaded algorithms for the FFT computation based on the iterative solution. These algorithms di er from each other in terms of data ow style and in number and sizes (weights) of the threads employed.
2.1 Receiver-Initiated Algorithm
The receiver-initiated algorithm is a ne-grain multithreaded algorithm based on the Cooley-Tukey style 4] of the FFT signal ow graph. Let us assume we have N ( N=2m ) data elements and P (P=2p ) processors. A butter y computation (Figure(1)) is performed on each of the data points in every iteration and there are altogether log iterations. The butter y computation can be conceptually described as follows: a and b are points or complex numbers.The upper part of the butter y operation computes the summation of a and b with a twiddle factor w while the lower part computes the di erence. In each iteration, there are N/2 summations and N/2 di erences.
N
In general, a parallel algorithm for FFT, with blocked data distribution of N elements on P processors, involves communication for log iterations and terminates after log iterations. If we assume shu ed input data at the beginning, the Cooley-Tukey Style (Figure 2), the rst log ? log iterations require no communication. Therefore, during the rst log ? log iterations, a sequential FFT algorithm can be used inside each processor. At the end of the (log ? log ) iteration, the latest computed values for N P data points exist in each processor. The receiver-initiated algorithm consists of two phases, the sequential phase and the multithreaded phase. The mulithreaded phase of the algorithms starts at the end of the log ? log iterations.
P N N P N P N P th N P
Conceptually, the multithreaded phase starts from the nal output. Consider the output data points at the end of the log -th iteration. The butter y computation for any data point in this iteration requires two data points from the previous iteration (Figure(1)). For the multithreading phase, the algorithm works on only N 2 output data points. data points can be generated with just one The remaining N 2 additional local butter y operation for each point. Therefore, given P processors, each processor computes FFT for only 2N P data points at the nal output.
N N
iteration 0 x0 P w0 8 x4 x2 P
1 0
iteration 1
iteration 2
iteration 3 y0
w0 8 w0 8 w2 8
y y
w0 8 x6 x1 w1 8 y3 y w0 8 x5 x3 w0 8 w2 8 w2 8 y w3 8 y
6
w0 8 x7
which holds 0 sends out two requests using two separate threads: one to its mate processor 2 and the other to itself for the computed data elements 1 and 0 at iteration 2, respectively. Of course, at iteration 2, 0 and 1 have not yet been computed. Therefore, consider the actions of processor 2 at iteration 2. This involves: Processor 2 upon receiving and executing the thread from 0 sends out two more threads to 2 and 3 for data elements at iteration 1. At iteration 1, the latest locally computed data values exist 0 0 and 2 and 3 transfer 1 ( 1 + 8 5 ) and 3 ( 3 + 8 7 ) respectively to processor 2 which requested these data at iteration 2. At this point 2 computes the butter y operation and sends the result back to 0 which requested it at iteration 3. Now 0 has received one data element. Similarly the same type of communication is performed to receive the second data element. When the two data points have arrived at 0 the butter y operation is performed and 0 is computed. Now the processor 0 holding 0 computes its mate's value 4 at the last iteration.
y P x x x x P P P P P P P x x w x x x w x P P P P P y P y y
Note that the butter y computation is performed only after the two data values have arrived for the two threads sent out. The thread computing the butter y computation is therefore synchronized by two signals. The arrival of these signals acknowledges the arrival of the two data elements computed at the previous iteration. The above algorithm can be illustrated with an example. Consider Figure(2) with (N = 8) data elements and (P = 4) processors. Since the multithreaded algorithm is performed N on only N 2 points, each processor contains 2P output data points. That is for this example, 0 has 0 , 1 : 4 , 2 : ? log iterations are per2 and 3 : 6 . The rst log formed locally by each processor. And therefore, the computed values of the butter y operation for each data element is available at the end of the log ? log iterations. The processors then switch to the multithreaded version of the algorithm.
P y P y P y P y N P N P
In the above scheme, a parent-child relationship is established between threads. This parent-child relationship and the synchronization signals which act as acknowledgment signals allow e cient multithreading. It also ensures the correctness of the program without any data race conditions or corruption of data. Also, there are equal number of threads per processor, thereby, balancing the work load. iN For 2N P data points per processor, 2 2P threads are sent out, where = 1 log . The processors execute the butter y computation for each related pair of threads as per the arrival of data points and these could be in any order. Therefore, a processor either sends out threads or performs computations; it never sits idle. The algorithm e ciently overlaps computation with communication. Note that in this algorithm, butter y computations over same data elements are computed in di erent processors, giving rise to redundant computation load. However, the algorithm can be easily adapted to varying degrees of parallelism and synchronization overheads. The analytical section explains the complexity analysis in detail.
i P
The sender-initiated algorithm is based on the GentlemanSande 7] signal ow graph for the FFT problem. The number of threads is xed at compile time to be equal to N B, where is the block size, consisting of contiguous data elements. The N B threads are distributed to each of the processors in a round-robin fashion, thereby balancing the load across the processors. Each processor performs the FFT computation on data points.
B B
Let us consider one particular data element 0 at log -th iteration, that is at iteration 3 (refer to Figure( 2)). The receiver initiated approach starts from 0 and proceeds backy N y
The Gentleman-Sande signal ow graph (Figure 3) can be viewed as each data point requiring a mate data point for performing the FFT computation. The mate may be located in a di erent thread in a di erent processor depending on the FFT iteration. In this case, a thread (called the consumer thread) for each of its points sends the recently computed value to the thread (called the producer thread) containing the mate points. The sending and receiving of information requires certain amount of synchronization between the producing and consuming threads to be set up apriori. Also, the
x0 x1 x2 x3 x4 x5 x6 x7
+ +
(where n=0 or 1 ... or 3) and sends the computed values to the consuming mate thread (which is in thread 2 of P2 ). Similarly the consuming thread of thread 0 in P0 receives the computed values ( 4 wn , 5 wn (where n=0 or 1 ... or 3)) from the producing thread of thread 2, P2 . In the next iteration, thread 0's mate thread is thread 1. The setting up of synchronization slots between the threads is performed at the start of the new iteration.
x x
In the Receiver-Initiated algorithm, given points and processors, the data points are partitioned into block of size N and each block is assigned to one processor. This alP gorithm can be implemented incorporating varying degrees of parallelism depending on the target architecture and the number of processors. In the following, we analyze this algorithm under two such scenarios.
N P
As mentioned above, each thread consumes data from the previous iteration and produces data for the next iteration. This producer-consumer function is realized as a second level thread (a thread within a thread), called ber. We explain the concept of second-level threads as follows. The data is produced in a producer thread and using a split-phase transaction operation, the produced values are delivered to the corresponding consumer thread in another processor. The consumer thread in the other processor is activated when it receives a synchronization signal from its mate thread. Note that at each iteration, the threads have to determine the location of its mate thread and set up the synchronization slots appropriately during runtime. Therefore, the producer and consumer threads act as second level threads ( bers) within a threaded function. The synchronization slots act as acknowledgment signals and the second-level threads comprise a data- ow style of programming. We illustrate the above producer-consumer approach with an example (Figure 3) (We have represented the signal ow graph di erently 7] for easier explanation of the senderinitiated algorithm). Assume N=8, P=4 and B=2. Then there are N B = 4 threads. Points 0 , 1 are in thread 0; 2 , 3 in thread 1; 4 , 5 in thread 2; 6 , 7 in thread 3. These threads are distributed to each of the 4 processors (thread 0 is executed by 0 , thread 1 by 1 ,etc.,). In Figure 3, all edges going upwards are marked positive (+) and all edges going downwards are marked negative (-). This indicates that a+bw or a-bw is computed at + and - marked points respectively. Consider the rst iteration of the algorithm. The mate points of 0 , 1 in thread 0, P0 are 4 , 5 and are located in thread 2, P2 . Thread 0 computes 0 wn , 1 wn
x x x x x x x x P P x x x x x x
In the rst scenario, we exploit full parallelism by generating maximum number of threads at the expense of redundancy in the butter y computations at di erent processing nodes. In the second scenario, we limit the number of threads generated to completely avoid redundant computations at the expense of synchronization overheads between di erent processors. The implementation results are reported only for the rst scenario. In both the scenarios, the ratio of local versus remote threads are studied. The performance of the algorithm lies in the balance between the number of remote and local threads and on their overlapped asynchronous scheduling. Consider a particular point i at log -th iteration during the multithreaded phase of the algorithm. Initially, it sends out two requests in the form of two threads. These threads at log ? 1-th iteration in turn send two more threads for a total of four threads. This process continues over log iterations. The thread generation process in this manner can be viewed as a binary tree of height starting from each of the data point i in the nal iteration. The internal nodes of such a tree correspond to the threads performing butter y computations and the arcs correspond to the threads gathering the data. Thus the number of threads performing the (remote) communication is 2( ? 1) and the number of local threads performing the local computations is ( ? 1). There are 2 such binary trees of parent-child threads, one corresponding to each even indexed data point in the nal output column. The odd indexed data will be automatically computed due to the butter y computation in the nal iteration. Therefore, total number of threads is ( ? 1) and the number of butter y computation for the nal log iterations is N 2 ( ? 1).
y P P P P y P P N= N P P P
3.1.1 Scenario 1:
Considering the rst log ? log iterations as well, the processors perform local butter y computations on N P points over these iterations. This can be realized as a sequential FFT algorithm over N P data points using a single thread in each processor. Therefore, there are local threads per iteration for log ? log iterations. However, each thread N in this case is performing 2N P log P butter y computations. For the next log iterations, the multithreaded algorithm is performed.
N P P N P P
Summarizing, the total number of local threads is + N 2( ? 1) and the total number of remote threads is ( ? 1). The ratio of local threads versus remote threads is: N ( ? 1) + 2 ( ? 1) = (1)
P P N P P P N P O
The number of threads in the algorithm can be adopted to the architectural features of the target platform. Each thread consists of producer and consumer bers (microthreads). However, in the analysis we will treat the bers and threads as the same. In the multithreading phase of the algorithm, the rst log iterations, the number of remote threads is log when . The number of local threads is log during the multithreaded phase. However, there is an additional cost of log over log iterations for setting up the synchronization slots at each iteration. The ratio of local to remote threads is : ( + log ) log
P N=B N P P N=B P N P P N P N P N=B P
) for
P >
log
In the above analysis, each of the N 2 points requests data from its mate processors by means of threads and these requests yield a binary tree pattern. However, the N 2 binary trees created are not necessarily unique, thereby, duplicating work. The algorithm can be implemented by realizing only unique trees, decreasing the amount of computation and avoiding to send unnecessary number of threads. For example, if i and j follow the same path, then i performs the butter y computations and communicates the computed value to j eliminating j from performing the same computations as i . This can be realized as follows: Initiate 2 N 2 threads at the Nth iteration, each collecting a unique data point generated at the N-1th iteration. Next, initiate only 4* N 4 out at N-1th iteration. This process continues for log N threads are sent out. iterations, at which point, 2log P * 2log P Therefore, in the multithreaded phase of the algorithm, the total number of threads involving communication is log and the number of butter y computations is N 2 log .
y y y y y y P N P P
3.1.2 Scenario 2:
When N B = , this yields a very coarse grained algorithm with local to remote threads ratio of ( N P ). When is a small constant, the algorithm is highly ne-grained with local to remote threads ratio of (1). The number of butter y computations in all the cases is N 2 log .
P O B O N
Note that in terms of the complexity analysis, the sender initiated approach is comparable to the second scenario of the receiver-initiated algorithm.
4. EXPERIMENTAL FRAMEWORK
In the following, we brie y describe the EARTH model and platform that have been used in our experiments. EARTH (E cient Architecture for Running THreads) 9] is a multithreaded program execution model targeted to high-performance of parallel and distributed multiprocessing. The EARTH platform supports latency tolerance by e cient exploitation of ne-grained parallelism available in many applications. In the EARTH programming model, code is divided into threads that are scheduled atomically using data ow-like synchronization operations 9]. Conceptually, each EARTH node consists of an Execution Unit (EU), which executes the threads and a Synchronization Unit (SU), which performs the EARTH operations requested by the threads. The current hardware designs for EARTH use an o -the-shelf high-end RISC processor for the EU and custom hardware for the SU 17]. However, other implementations are also possible. In the EARTH programming model, a programmer can express parallelism by utilizing two form of threads: rst-level and second-level threads. First-level threads are declared as threaded functions. When a threaded function is invoked, a thread is spawned to execute the function. Note that the caller thread will continue its own execution without waiting for the return of the forked threaded function.The body of a function can be further partitioned into bers 24]. These bers are referred to as second-level threads. Whenever a user suspects that an operation may incur unpredictable latencies, the user can choose to use an EARTH split-phase transaction operation. In a split-phase transaction, data transfer and synchronization are combined into an atomic operation to avoid potential race conditions in the network. A thread need not block until a synchronization signal is received when using this operation. It may
Summarizing, the total number of local threads including the sequential phase is + N 2 log , and the ratio of local to remote threads: ( +N 2 log ) = (1) log The total number of butter y computations:
P P P P N P O :
In the Sender-Initiated approach, the algorithm can be formulated as if the number of processors in the system has no bearing to creating the threads. For a given block size and data points, there are N B threads in the system.
B N
b threads per processors, threads are dis: P tributed to processors in a round robin fashion.
P > b
execute other instructions. A synchronization signal may trigger the spawning of other threads. For example, an user may decide to put the consumer who will need the result of the long latency operation in a di erent ber. The producer thread may synchronize the consumer thread when its data is ready. This ensures that a ber can be executed in a non-preemptive fashion avoiding any waste of processor resources. The EARTH runtime system will hide the latency by multithreading as long as the program has enough parallelism to generate threads or bers. Currently, programs are written in T hreaded-C, which extends the C language with multithreading instructions. It is clean and powerful enough to be used as a user-level, explicitly parallel programming language. The EARTH programming model has been realized on a MANNA platform. MANNA ( Massively parallel Architecture for Numerical and Nonnumerical Applications ) is a multiprocessor platform built by GMD-FIRST. Each processing node consists of two Intel i86x XP RISC CPUs (similar to the Intel Paragon), but without the OS " rewall" to facilitate runtime system research and experiments.
5. PERFORMANCE RESULTS
Number of Processors vs. Execution time 1000 900 800 700 600 500 400 300 200 100 0 0
10
10
20
30 40 Number of Processors
50
60
70
Figure 4: Receiver-Initiated Algorithm: Scalability w.r.t machine size with varying problem size on EARTH-SPN(Total Elapsed Time)
gorithm is depicted in these gures on EARTH-SPN and EARTH-DUAL. This includes both log(N)-log(P) iterations of local computation and log(P) iterations involving remote computation/communication (multithreading). Notice that for small number of processors the execution time is steep but as the machine size increases the performance is signi cantly improved. For small values of the number of threads to be handled is relatively large and that is the reason for higher execution times in such cases. For example, if = 214 , and = 64, each processor contains 28 data points per processor ( N P data points). The number of threads generated in the system is ( ? 1) which is (63) 214 . Since at each iteration a processor requires a mate to compute its butter y computation, each processor sends out a thread to its mate processor requesting data. Each processor is either busy sending a thread requesting data or busy handling the request. This is performed by each processor at every iteration. Therefore, the processors load is equally balanced and performing either one of the tasks, eliminating the need to be idle at any point in time. The algorithm has overlapped computation with communication appropriately, thereby, producing a near-linear speedup.
P N P N P
In this section, we discuss the performance results for the algorithms presented in the previous sections. The algorithms have been implemented in the Threaded-C language on the simulator for EARTH- MANNA called SEMi. There are two con gurations supported by the SEMi simulator: EARTH-MANNA-D and EARTH-MANNA-S con gurations. In section 4, we explained that the EARTH EU and SU emulate the two processors of the MANNA machine. This is called the dual processor (DUAL) version or EARTHMANNA-D. But since most multiprocessors have only one CPU per node, we also have a single processor (SPN) implementation where only one processor of the MANNA machine emulates both the EU and SU. With only a single CPU to execute both the program code and the multithreading support code, it is necessary to nd an e cient way to switch from one to the other. The EARTH operations are therefore replaced by in-line code in the EU to carry out these operations rather than sending the requests to the SU. For some simple operations, doing them in-line in the EU may take less of the EU's time than sending the request to the SU 24]. We have experimented with both these con gurations for both the sender-initiated and receiver-initiated algorithms. For reference purposes, we have also implemented the best known sequential algorithm for the FFT on a single node MANNA. The algorithm takes 866ms for performing FFT on 216 data points on an i860 processor 24]. In the receiver-initiated algorithm, we partition the output points into N P contiguous points and distribute them to each of the processors. The rst log ? log iterations are local computations and the last log iterations require remote communication realized as a multithreaded phase in this algorithm.
N N P P
Comparing the performance on SPN and DUAL con gurations, we observe that if we ood the system with enough parallel threads the performance of the multithreading implementation is improved signi cantly as the number of processors is increased. One implication is that as long as there are enough parallel threads in the system, the processors are never idle. The performance results with respect to varying machine size on the processors are depicted in Figures(6,7). As the gures show, with the increase in the problem size the execution time decrease for varying processors size. The relative speedup is about 50% on 64 processors. We observe that in the above gures, the SPN con guration performs better than the DUAL con guration. In the
Figures(4,5) show the performance results with varying problem size. The total execution time of the entire FFT al-
1000
500
P=8 P = 16 P= 32 P = 64
800
400
600
300
400
200
200
100
0 0
0 10
10.5
11
11.5
12
13.5
14
14.5
15
Figure 5: Receiver-Initiated Algorithm: Scalability w.r.t machine size with varying problem size on EARTH-DUAL(Total Elapsed Time)
SPN version there is a single processor that performs both the task of the EU and SU. That is, it handles the network communication/synchronization and computation aspect of the algorithm. However, this does not seem to degrade the performance of the algorithm and also its performance is better than the DUAL con guration which has two processors to perform the tasks of EU and SU. In SPN, the EU performs all the EARTH operations e ciently without the need to send to the SU like in the dual processor which creates an overhead and wastes CPU time unnecessarily. Figures(8,9) show the scalability results as the input problem size increases for both the DUAL and SPN con gurations. The number of points per thread is 16 (B = 16). Therefore for N = 212 there are 256 threads and for N = 216 , there are 4096 threads in the system. The EARTH-SPN version performs better than the EARTH-DUAL version for small number of processors, especially. However, for large number of processors, we observe that the execution time in both cases is very minimal for all problem sizes. We see that the proper overlap of communication and computation has produced better results even with one processor performing both tasks. In the DUAL version, the overhead involved in sending messages to SU by EU creates a bottleneck every time the EU needs to communicate remotely, as mentioned earlier in the receiver-initiated approach. This, therefore, is the reason for poor performance for very small number of processors in the DUAL version as in the case of receiverinitiated approach. Figures(10,11) show the scalability results as the number of points per thread is increased on a xed size, = 212 . For = 256, the number of threads in the system is N B = 16 threads. We observe after 16 processors , there is no change in the execution time. The maximum number of processors that will be kept busy using a round-robin load balancing fashion is 16 since there are only 16 threads in the system. There is not enough parallelism (threads) in the system to
N B
Figure 6: Receiver-Initiated Algorithm: Scalability w.r.t problem size with varying machine size on EARTH-SPN(Total Elapsed Time)
balance load on all processors. Beyond 16 processors, the others are idle. This is the reason for the stationary execution time after 16 processors for = 256. However, for = 4 (1024 threads) and = 16 (256 threads), there is enough parallelism to keep all processors busy. Therefore, we see a gradual decrease in the execution time as the number of processors increases. The best result is obtained when there are 16 threads and 16 processors. This leads to a coarse-grained implementation with one thread in each processor. If there is more than one thread in the processor (e.g. 1024/64 = 16 threads/processor), each processor executes a thread to completion before switching to its next thread. There are points in a thread. So each thread executes the FFT algorithm sequentially on its points, then uses a split phase transaction to send the produced results to the consumer thread. It is after this split phase transaction operation that the processor switches to the next thread. This is the reason that the execution time for 32 processors on a block size of 4 is slightly more than that of block size 16. If we compare both SPN and DUAL versions, the SPN version does better and the same reasoning as explained previously holds.
B B B B B
We have noticed poor scalability in the sender-initiated approach for a xed block size on di erent machine sizes. The number of threads is proportional to the number of blocks and is independent of the number of processors. This obviously indicates that one has to choose the appropriate block size to provide enough threads in the system for full load balancing of the processors. Figures(12,13) compare the performance results of the two approaches (receiver versus sender initiated) on an input of size 214 with two di erent block sizes for the sender-initiated method. The comparison is between the total elapsed time in both cases. Note that for a block size of 1, the receiverinitiated approach performs slightly better than the senderinitiated approach. The reason behind this is that for N = 14 214 there are N B = 2 threads generated. Each data point
700
P=8 P = 16 P= 32 P = 64
FFT Execution Time (msec)
12
500
400
300
200
100
100 0 0
0 10
10.5
11
11.5
12
13.5
14
14.5
15
10
20
30 40 Number of Processors
50
60
70
Figure 7: Receiver-Initiated Algorithm: Scalability w.r.t problem size with varying machine size on EARTH-DUAL(Total Elapsed Time)
is now a thread. These threads are distributed in a round robin fashion to each of the processors. For P = 64, each processor contains, 28 threads. The synchronization primitives between these threads have to be set up dynamically at runtime. At each iteration, there are 64 synchronization primitives that need to be coordinated. This is very time consuming. Thus, for large number of processors, the receiver-initiated approach performs better. When P = 2, the number of14threads per processor in the sender-initiated approach is 22 = 213 . For the receiver initiated approach there are ( ? 1) threads (214 ). However, as the gure shows the execution time in both approaches for = 2 is approximately the same. In the sender-initiated approach, due to the number of synchronization slots that needs to be set up between the two processors for the threads at each iteration during runtime creates performance degradation. And for the receiver-initiated approach there are too few processors to handle the huge number of threads.
N P P
Figure 8: Sender-Initiated Algorithm:Scalability w.r.t to machine size with varying problem size and xed block size on EARTH-SPN
is again time consuming. Thus, the execution time is no better than the receiver-initiated approach for large number of processors. For = 2, there are 26 threads per processor compared to 214 threads in the receiver-initiated method. Therefore, the sender-initiated method performs better.
P
As the block size increases the number of points per thread also increases. The computation within a thread is sequential. Since there is one synchronization slot assignment per thread the set up greatly reduces as the block size reduces. Also, increasing the block size makes the problem coarsegrained in the sender-initiated approach. In conclusion, it is safe to say that for a given problem size and machine size, the receiver-initiated method performs better than the sender-initiated approach for smaller block sizes. In the sender initiated method, the block size have to be properly de ned to get good performance. The more coarse grained the problem gets, the better the results are. For large number of processors, the receiver-initiated method has always performed either equally or better. In this paper, we have presented two multithreaded algorithms for the FFT problem: receiver-initiated and senderinitiated. In the receiver-initiated approach the multithreaded version of the algorithm due to its ne-grain communication/computation ratio produced superb results for large number of processors. This algorithm extracts full parallelism in the FFT computation. We achieve a near linear speedup as the number of processors increases, even when there are large number of threads in the system. In the sender-initiated approach the number of threads in the system is xed at runtime and can be independent of the number of processors. We observed that the best result is obtained when there is one thread per processor which produces a coarse-grained implementation. Our implementation showed that for certain block sizes on xed problem size and machine size, the receiver-initiated approach performed
6. CONCLUSIONS
However when = 2, the number of threads is 213 for the sender-initiated approach. There are now two points per thread. Again for = 64 processors there are 27 threads per processor with two points per thread. The butter y computation within the points in the thread is sequential. Note that in Figure(13), for small number of processors, the synchronization slot assignment does not a ect the performance of the algorithm. This is mainly because one synchronization slot is set up for both the two points in a thread. In general this is true when , and are powers of 2. The algorithm works such that the mate points for a particular thread reside in the same thread of its mate processor. It is therefore not necessary to set up a sync slot for each of the two points in the thread, but rather assign one sync slot per thread. This greatly reduces the synchronization slot compute time. For only two processors the synchronization is between the two processors only. However for 64 processors, the synchronization mechanism is between all 64 processors which changes dynamically at runtime. So, each processor needs to set up 27 synchronization slots at runtime. This
B P N P B
12
1000
100
800
600
80
60
400
40
200
20
0 0
0 0
10
20
30 40 Number of Processors
50
60
70
Figure 9: Sender-Initiated Algorithm:Scalability w.r.t to machine size with varying problem size and xed block size on EARTH-DUAL
better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed. Overall, the sender initiated algorithm gave the best performance for smaller machine sizes and certain block sizes, while for large machine sizes both the algorithms performed equally well.
7. REFERENCES
Figure 10: Sender-Initiated Algorithm:Scalability w.r.t to machine size with varying block size and xed problem size on EARTH-SPN
8] Hennesey J.L. and Patterson D.A. Computer Architecture: A quantitative Approach, Second Edition. Morgan Kaufmann,Inc., San Francisco,CA, 1996. 9] Hum H.H.J. et. al. A study of the earth-manna multithreaded system. In Intl. J. of Parallel Programming, volume 24(4), pages 319{347, Aug. 1996. 10] Hwang K. . Advanced Computer Architecture: Parallelism,Scalability, Programmability. McGraw-Hill,Inc., New York,NY, 1993. 11] Jamieson L.H, Delp E.J et.al. A library based program development environment for parallel image processing. In Scalable Parallel Library Conference, pages 187{194, Mississippi State University, Mississippi, 1993. 12] Kamin R.A. and Adams G.B. Fast fourier transform algorithm design and tradeo s on the cm-2. In Proc. Workshop Comput. Arch. Pat. Anal. Mach. Intell., pages 184{191, Oct. 1987. 13] Kumar V. and Grama A. et. al. Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings Publishing Company, 1994. 14] Leighton F.T. Introduction to Parallel Algorithms and Architectures. Morgan Kaufmann, San Mateo, California, 1992. 15] Leiserson C. Cilk. In http://supertech.lcs.mit.edu/cilk, 1999. 16] Loan C.L. Computational frameworks for the fast fourier transform. SIAM Journal, Frontiers in Applied Mathematics, 1992. 17] Maquelin O. et. al. Costs and bene ts of multithreading with o -the-shelf risc processors. In
1] Angelopoulos G. and Pitas I. Parallel implementation of 2-d t algorithms on a hypercube. In Proc. Parallel Computing Action, Workshop ISPRA, Dec. 1990. 2] Angelopoulos G., Ligdas P. and Pitas I. Two-dimensional t algorithms on parallel machines. In Transputing for Numerical and Neural Network Application, G.I. Reijns, editor, IOS Press, 1992. 3] Cho-Chin Lin, V.K. Prasanna, and A.A Khokhar. Scalable parallel extraction of linear features on mp-2. In Workshop on Computer Architectures for Machine Perception, pages 352{361, New Orleans, Louisiana, 1993. IEEE Computer Society Press. 4] Cochran W.T and Cooley J.W et.al. What is the fast Fourier transform? IEEE Transactions on Audio and Electroacoustics, 15:45{55, 1967. 5] Cooley J.W. and Lewis P.A. and Welch P.D. The Fast Fourier transform and its application to time series analysis. Wiley, New York, 1977. In statistical Methods for Digital Computers. 6] Frigo M. and Steven. Fftw. In http://theory.lcs.mit.edu/ tw, 1999. 7] Gentleman W.M and Sande G. Fast Fourier transforms for fun and pro t. In Proc. 1966 Fall Joint Computer Conference AFIPS 29, pages 563{578, 1966.
12
100
80
900
60
800
40
Execution Time(msec)
700
600
20
500
400
0 0
10
20
30 40 Number of Processors
50
60
70
300
200
Figure 11: Sender-Initiated Algorithm:Scalability w.r.t to machine size with varying block size and xed problem size on EARTH-DUAL
Proc. of the First Intl. EURO-PAR Conf., pages 117{128, Stockholm, Sweden, Aug. 1995. Springer-Verlag. Oppenheim A.V. and Willsky A.S. Signals and Systems. Prentice Hall, Englewood Cli s, New Jersey, 1983. Pease M.C. An adaptation of the fast Fourier transform for parallel processing. Journal of the ACM, 15:252{264, 1968. Pitas I. Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks. John Wiley and Sons, New York, NY, 1993. Prasanna V.K, Cho-Li Wang and Khokhar A.A. Low level vision processing on connection machine cm-5. In Workshop on Computer Architectures for Machine Perception, pages 117{126, New Orleans, Louisiana, 1993. IEEE Computer Society Press. Sohn A., Kodama Y., et.al. Fine-Grain Multithreading with the EM-X. In Ninth ACM Symposium on Parallel Algorithms and Architectures, pages 189{198, Newport, Rhode Island, June 1997. Stone H.S. Parallel processing with the perfect shu e. In IEEE Trans. Computers, C-20, pages 153{161, 1971. Kevin Bryan Theobald. EARTH: An E cient Architecture for Running Threads. PhD thesis, McGill, Montreal, May 1999. Thompson C.D. Fourier transforms in VLSI. IEEE Transactions on Computers, 32:1047{1057, 1983.
100
10
20
30 40 Number of Processors
50
60
70
Figure 12: Comparison between the sender-initiated and receiver-initiated (half data size)approaches
Number of Processors vs. Execution time on N=214 1000 900 800 700 Execution Time(msec) 600 500 400 300 200 100 0 0
10
20
30 40 Number of Processors
50
60
70
Figure 13: Comparison between the sender-initiated and receiver-initiated (half data size) approaches