A Parallel FDTD Algorithm Using The MPI L
A Parallel FDTD Algorithm Using The MPI L
Davidson
Rad. Lab., EECS Dept. Dept E&E Engineering
University of Michigan University of Stellencosch
Ann Arbor, MI 48109-2122’ Steilenbosch 7600, South Africa
(734) 647-1797 (+27) 2 1 808 4458
(734) 647-2106 (Fox) (+27) 2 1 808 498 1 (Fax)
volakisQumich.edu (email) davidsonQfirga sun.ac za (e-mail)
Readers might like to refer to the July, 1998, ACES Journal We thank the authors for their contribution, which continues
“Special Issue on Computational Electromagnetics and High Per- the particularly intemational flavor of recent columns!
formance Computing,” edited by David Davidson and Tom Cwik
Abstract
In this paper, we describe the essential elements of a parallel algorithm for the FDTD method using the MPI (Message f a s s -
ing Interface) library. To simplify and accelerate the algorithm, an MPI Cartesian 2D topology is used. The inter-process
communications are optimized by the use of derived data types. A general approach is also explained for parallelizing the
auxiliary tools, such as far-field computation, thin-wire treatment, etc. For PMLs, we have used a new method that makes it
unnecessary to split the field components. This considerably simplifies the computer programming, and is compatible with the
parallel algorithm.
Key words: Parallel algorithms; MPI; FDTD methods; inter-process communication; process topology; multiprocessor
interconnection; perfectly matched layers
aH
Vx E = -p-,
dt
a
Memory PO Memory P 1
3.1 Creation of Cartesian Topology
where
each process obtains its own data, which can be different in each
case.
H j ( l : m , l : n y , l : n z ) a H,(O:nx,O:ny,l:nz)
with i = [x,y , z ]
1. MPZinitialization ($2.2)
+Communication
Determination of the process number and their ID number
/€€E Antennas and Pfopagafion Magazhe, Vol. 43, No, 2,April 2001 99
The efficiency is presented in Figure 10 as a function of the
number of processors. This figure shows that decomposition along
the y axis (1 x Nprocs) is more efficient than along the x axis
(Nprocsx 1). This is due to the continuity of the data in the mem-
ory along the x axis. As a consequence, the memory is accessed
more quickly.
Communication
rent computations, etc. Among these additional tasks, some take a
negligible amount of time, and others take longer. This fact should
40 ...................................................................................................... ............... be taken into account in the distribution of the cells among the dif-
by vector
ferent subspaces, in order to preserve load balancing. This is par-
? Communication ticularly true for PMLs. On the other hand, many of the additional
20 ............
....................................,.........................................I By matrix ...............
tasks do not need communication instructions, except for the post-
processing stage, where the data must sometimes be grouped in a
master processor. The far-field computation is an example of post-
2 4 6 8 10 12 14 16 18 20 communication. For the time iterations, each process calculates the
Number of processors = N,*N, current sources on the Huygens surface and their contribution to
the far field. After the time stepping has been completed, the con-
Figure 10. The efficiency as a function of the processor uum- tribution of all processes to the far-field values is communicated to
ber, for a volume of 60 x 60 x 50 cells. a unique process. This last process reconstructs the far-field values
and then stores them.
The scalability is the ratio between the execution time on one Two cases may be considered:
process ( T , ) and the execution time on “n” processes (T,). This
parameter shows directly the gain in time with n processes. The 1. The task corresponds to a spatial point (field storage, local
efficiency is the ratio of the scalability to the process number. excitation,....). In this case, only one process is concemed, and
the parallelism consists in transmitting the coordinates and the
In Figure 9, the scalability of the algorithm clearly showed treatment to this process.
the advantage of matrix communication, notably for small sub-
spaces (more than 10 processors, or subspace less than 60 x 6 x 2. The task extends over several subspaces. Each process thus
50). For eight processors, the aforementioned anomaly appeared executes its own portion of the task. It must then verify
clearly, due to the unequal process loads. whether communication is required with other processes.
100 /€E€ Antennas and Propagation Magazine, Vol. 43,No.2, April 2001
Consider, for example, the voltage calculation between two
points, A and B. This is realized by the summation of the electrical
field along a line joining the points A and B. Each process needs to
h o w if its subspace contains a portion of the summation line.
Then, it needs to know the two limited points on its own subspace.
Finally, after the time stepping, the processes concemed in the
voltage calculation communicate their values to a master process,
in charge of calculating the final voltage and storing it.
PML Regions
Process 4
I
1
/€E€ Antennas and Propagation Magazine. Vol. 43, No. 2, April 2001 101
without modification in the PML regions. Only four additional of- processes. Moreover, the communication algorithm was opti-
components are needed in each PML region. Besides, no additional mized with the help of the “derived data type,” which permits the
communications are required with the GUEHPMLs. The parallel- data to be grouped. The performance was shown for one-dimen-
ism consists of balancing the number of cells between the proc- sional and two-dimensional topologies. From 20 cells per direction
esses containing a part of the PMLs, and those processes not con- upward, the efficiency was greater than go%, and it increased with
taining them. the size of the subspaces. The approach for the parallel implemen-
tation of the auxiliary tasks was also described. It appears that the
Let us consider a factor Fa such that parallelism consists mainly in distributing the data and the specific
treatments to each process. For most of our FDTD simulations, the
computational time was reduced to less than half an hour. Finally,
we have shown that the PMLs can be treated in an efficient parallel
TcellulePML is a cell calculation time in the PML, and TceNule ut,le is
algorithm if a non-split field formulation is used.
a cell calculation time in the inner region. The factor Fa permits an
estimate of the number of cells in each subspace. For example, in
Figure 11 we divided a space into nine subspaces. It is then easy to 5. References
determine the number of cells along the x axis for the processes 2,
5, and 7, and along the y axis for the processes 4, 5, and 6. 1. A. Taflove, Advances in Computational Electrodynamics,
Although the four processes in the corner (1, 3, 7, 9) seem to have Artech House, 1998.
smaller subspaces, their load is close to that of the other processes,
because the PML comer zones have twice as many additional 2. W. J. R. Hoefer, “The Transmission-Line Matrix Method and
instructions as those processes in the PML edge regions. Applications,” IEEE Transactions on Microwave Theory and
Techniques, MTT-33, 10, October 1985, pp. 882-893.
To demonstrate the efficiency of the parallel FDTD algorithm
including the GUEHPMLs, we used nine (3 x 3) and 16 (4 x 4) 3. J. Lee, R. Lee, and A. Cangellaris, “Time-Domain Finite-Ele-
processors on the Cray T3E. The computation was performed in a ment Methods,” IEEE Transactions on Antennas and Propagation,
volume size of 150 x 150 x 50 cells, surrounded by 10 PML layers. AP-45,3, March 1997.
The number of time iterations was limited to 100. Figure 12 shows
that the efficiency was greater than 85% in both cases (nine and 16 4. K. S. Yee, “Numerical Solution of Initial Boundary Value
processors). In Figure 13, the computational time is compared Problems Involving Maxwell’s Equations in Isotropic Media,”
among the Cray T3E (300MHz) and two recent PCs (400MHz IEEE Transactions on Antennas and Propagation, AP-14, May,
and 800 MHz). The PC800 was equivalent to only two processors 1966, pp. 302-307.
of the Cray T3E. The nine processors were equivalent to an ideal
operating number of eight processors, while 16 processors were 5. A. Taflove, Computational Electrodynamics, Nonvood, Massa-
equivalent to an ideal operating number of 14 processors, because chusetts, Artech House, 1995.
the gains were 8 and 14, respectively.
6. W. Gropp, E. Lusk, and A. Skjellum, Using MPI. Portable Par-
allel Programming with the Message Passing Interface, Cam-
bridge, Massachusetts, MIT Press, 1994.
4. Conclusion
7. V. Varadarajan and R. Mittra, “Finite-Difference Time Domain
(FDTD), Analysis Using Distributed Computing,” IEEE Micro-
We have presented a parallel FDTD algorithm, which may be
wave and Guided Wave Letters, 4 , 5 , May 1994, pp. 144-145.
easily implemented with the MPI library. The parallel computation
of the E-H components has been explained step by step, and the
8. K. C. Chew and V. F. Fusco, “A Parallel Implementation of the
MPI instructions have been given for a two-dimensional topology
Finite Difference Time Domain Algorithm,” International Journal
of Numerical Modeling Electronic Networks, Devices and Fields,
8, 1995, pp. 293-299.
Professor Emeritus Warren L. Flock died March 4, 2001, in He was the author of books, many joumal articles, and a
Boulder, Colorado. He was 80. Bom October 26, 1920, in Kellogg, famous NASA handbook on Earth-space radio wave propagation.
Idaho, he was the son of Abraham L. Flock and Florence Ashby His studies in radar omithology aided the understanding of radar
Flock. He married JoAnn Walton in Altadena, Califomia, on blackouts on the DEW line in Alaska and Canada, and of air-traffic
July 20, 1957. control and bird-migration pattems. The citation for his IEEE Fel-
low election read “For contributions in the application of radar
Dr. Flock received his BS in Electrical Engineering degree echoes for avoiding aircrafl-bird collisions.”
from the University of Washington in 1942; his MS in Electrical
Engineering from the University of Califomia, Berkeley, in 1948; He was a Fellow of the Explorers Club, and did indeed
and his PhD in engineering from the University of Califomia, Los explore the world, from Antarctica to Wrangel Island in the East
Angeles, in 1960. Siberian Sea with his wife, JoAnn, who has a PhD in botany. He
remained physically and professionally active up until two years
During WWII, he was a member of the staff of the MIT prior to his death.
Radiation Laboratory. He was a Lecturer and Associate Engineer at
UCLA from 1950 to 1960. He was Associate Professor of Geo-
physics from 1960 to 1962, and Professor of Geophysics at the
Geophysical Institute of the University of Alaska from 1962 to
1964. He joined the University of Colorado as Professor of Electri- Emest K. Smith
cal Engineering in 1964, and became Professor Emeritus in 1987. Campus Box 425
Electrical & Computer Engineering Department
He served as a consultant to the Rand Corporation; a Con- University of Colorado
sultant in Residence at the Jicamarca Radar Observatory in Lima, Boulder, CO 80309 USA
Peru; a member of the technical staff of the Caltech Jet Propulsion Tel: +I (303) 492-7123
Laboratory; and a visiting scientist at the Geophysical Observatory, Fax: +1 (303) 492-2578
DSIR, in Christchurch, New Zealand. E-mail: Smithek@boulder.colorado.edu G
I€€€ Antennas and Propagation Magazhe,Vol. 43, No. 2, April 2001 103