0% found this document useful (0 votes)

181 views

A Parallel FDTD Algorithm Using The MPI L

1. The document describes a parallel FDTD algorithm using the MPI library that divides the FDTD space into subspaces attributed to individual processors to distribute the computational load. 2. Key steps of the parallel algorithm are presented, including using an MPI Cartesian 2D topology and optimized inter-process communications with derived data types. 3. A general approach for parallelizing auxiliary FDTD tools like far-field computation and thin-wire treatment is also explained.

Uploaded by

napolesd

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views

A Parallel FDTD Algorithm Using The MPI L

Uploaded by

napolesd

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

John 1. Volakis David B.

Davidson
Rad. Lab., EECS Dept. Dept E&E Engineering
University of Michigan University of Stellencosch
Ann Arbor, MI 48109-2122’ Steilenbosch 7600, South Africa
(734) 647-1797 (+27) 2 1 808 4458
(734) 647-2106 (Fox) (+27) 2 1 808 498 1 (Fax)
volakisQumich.edu (email) davidsonQfirga sun.ac za (e-mail)

Foreword by the Editors

The FDTD method is one of the most widely used techniques (of JPL), which contained a number of papers on related topics. A
in computational electromagnetics. Due to the structured mesh - good description of an “un-split” PML (perfectly matched layer)
and, thus, the regular data structures - it is a prime candidate for formulation, as used by the authors, may be found in Steve
parallelizing using domain-decomposition techniques. This issue’s Gedney’s contribution (Chapter 5) in A. Taflove’s (ed.) Advances
contribution describes work on this using the Message Passing in Computational Electromagnetics: The Finite Dlfference Time-
Interface library, the emerging standard in this field. Domain Method (Nonvood, Massachusetts, Artech House, 1998).

Readers might like to refer to the July, 1998, ACES Journal We thank the authors for their contribution, which continues
“Special Issue on Computational Electromagnetics and High Per- the particularly intemational flavor of recent columns!
formance Computing,” edited by David Davidson and Tom Cwik

A Parallel FDTD Algorithm

Using the MP/ Library
C. Guiffaut and K. Mahdjoubi
Laboratoire ART (Antennas, Radar & Telecommunications), University of Rennes
35042 Rennes Cedex, France
E-mail: christophe.guiffaut@univ-rennesl .fr

Abstract
In this paper, we describe the essential elements of a parallel algorithm for the FDTD method using the MPI (Message f a s s -
ing Interface) library. To simplify and accelerate the algorithm, an MPI Cartesian 2D topology is used. The inter-process
communications are optimized by the use of derived data types. A general approach is also explained for parallelizing the
auxiliary tools, such as far-field computation, thin-wire treatment, etc. For PMLs, we have used a new method that makes it
unnecessary to split the field components. This considerably simplifies the computer programming, and is compatible with the
parallel algorithm.

Key words: Parallel algorithms; MPI; FDTD methods; inter-process communication; process topology; multiprocessor
interconnection; perfectly matched layers

94 /€E€Antennas a n d Propagation Magazine, Vol. 43,NO. 2, April 2001

1. Introduction

N owadays, numerical methods play a major role in almost all

branches of science and technology. They accelerate and
facilitate research and industrial development. The FDTD is one of
the most popular three-dimensional methods in electromagnetism,
because it covers many applications [I], such as antennas, CEM,
optical, diffraction, high-speed electronic circuits, biomedical, and
semiconductors, etc. Furthermore, it provides a wideband fre-
quency response via a simple Fourier transform from the time
domain. Although both the TLM (Transmission-Line Method) [2]
and the Finite Element Method [3] can be implemented in the time
domain, the FDTD seems to be easier to implement because the
basic Yee scheme [4] is explicit.

\. Let us consider the Maxwell equations: where

aH
Vx E = -p-,
dt

The FDTD method resolves these equations in the time domain by

applying central differences to time and space derivatives. For
example, this results in the following equations for the E, and H ,
components: At
1 1
n+-
2
Hz'i+i, j d , k
= H,ln-?
i+-,j+-,k
,
2 2 2 2

These equations are quite straightforward to implement.

Details on the programming are given in [ 5 ] .Note that division in
Equations (3) is costly in computational time, and should be
\ replaced with multiplication (by the pre-computed inverses).

The FDTD in Cartesian coordinates can deliver very high

performance on multiprocessor computers. In order to do this, the
FDTD space is divided into subspaces, and each subspace is attrib-
uted to one processor. An efficient algorithm is obtained when the
operating loads of the subspaces are equally distributed [Editor's
note: this is usually called load balancing.] In the following sec-
tions, we present the implementation of a parallel algorithm using
the MPI (Message Passing Interface) library [6].Each step of the
parallel algorithm is illustrated with some lines of code.
1

2. Parallelism with MPI

An equal distribution of the computational load and the

memory requirements across N processors theoretically reduces the
computational time by a factor of N. Several articles [7-113 have
demonstrated the interest in, and the feasibility of, parallelism
applied to the FDTD algorithm. The various different FDTD algo-
rithms that have been reported are based on a SPMD (single-pro-
J
,
gram multiple-data) architecture. Varadarajan et al. [7] have
reported one-dimensional-parallelism with PVM (Parallel Virtual
Although the time subscripts (n,n + 1/2) and the space sub- Machine) using the UDPITCPIIP protocol over the Ethemet for
passing inter-processor messages. In [lo], an FDTD algorithm was
scripts (i, i + 1/2, j , j + 1/2, k , k + 1/2) may be non-integer, when presented with Mur and PML (perfectly matched layer) absorbing
programming, integer arguments are used to evaluate the above boundary conditions and the near-field far-field transform, using
relationships, with the half-steps implied: the CM FORTRAN or HPF libraries. However, to the best of the

I€€€ Antennas a n d Propagation Magazine, Vol. 43, No. 2,ADril2001 95

With these architectures, two memory types exist: shared
memory and distributed memory. The first type concems mainly.
vector computers. The memory is common to all the processors.
However, the number of processors is limited to about 30, in order
................. to ensure low memory latency (15 ns on the Cray 98). With dis-
................ tributed memory, each processor has its own memory. Access to
...............
the memory of the other processors is not direct: memory access
between processors is performed via the interconnection network.
Massively parallel computers (more than 100 processors) use only
distributed memory in MIMD or MPMD architectures. These
machines can provide computational power exceeding a TeraFlop
(10l2 floating-point operations per second). The performance of
these machines very much depends on the ratio of the communica-
tion time through the interconnection network to the computational
time. The IPS 860, Paragon, CM5, and Cray T3E are distributed-
memory machines.

2.2 Processes and MPI

Figure 1. A SPMD architecture, with the FDTD volume
divided into N subspaces; each one is handled by a process. All MPI treats processes (not processors), which are grouped
the processes execute the same program. inside a communicator. The communicator defines the communi-
cation context. A process has a local memory and an execution unit
(Figure 2). One processor may run several processes. Before pro-
authors’ knowledge, a complete parallel implementation using the
gram execution, the user indicates the number of processes to be
MPI protocol (Message Passing Interface) has not yet been pub-
used from the operating-system command line. During execution,
lished. MPI is becoming the new intemational standard for parallel
some MPI procedures provide useful information to the program,
programming, and it is tending to replace the other parallel proto-
such as the number of processes used and their ID numbers. At the
cols, such as PVM. MPI operates with the F90, C, or C++ lan-
beginning of the program, the first MPI instructions concem the
guages. It is available on UNIX and Linux platforms. In the next
providing of those data retumed by the procedure
two sections, we will present the parallel machine architectures,
“MPI-COMM-S I ZE” and “MPI-COMM-RANK:”
and explain some useful MPZ communication types for the FDTD
algorithm.
c a l l MPI-INIT ( e r r o r c o d e )
c a l l MPI-COMM-SIZE(MP1-COMM-WORLD, nb-process,
errorcode)
2.1 Parallel Architectures c a l l MPI-COMM-RANK(MP1-COMM-WORLD, idnumber,
errorcode)
Most parallel architectures are grouped into three categories: where
SIMD (single-instruction multiple-data), MIMD (multiple-instruc-
tion multiple-data), and SPMD (single-program multiple-data). errorcode : e r r o r code r e t u r n
The first architecture, which achieved some success, was SIMD. MPI-COMM-WORLD : MPI d e f a u l t communicator
All processors in this architecture execute the same instruction and nb-process : r e t u r n e d number of p r o c e s s e s
are synchronized in time, under the direction of the sequencer. i dnumber : r e t u r n e d process ID number
Although programming is relatively easy, it is very difficult to
make a control network of sufficient capacity to provide signal To properly understand these instructions and those follow-
synchronization at a frequency higher than 10 MHz. Moreover, if ing, one should keep in mind that all processes execute them, and
each processor does not execute exactly the same instruction as the
others, the performance degrades quickly.

The second architecture (MIMD) is very general. Each proc-

essor performs its own instructions. In this case, synchronization is I Execution I
.
..
controlled by the developer, who uses the messages passing
through the interconnection network to achieve synchronization. unit
The SPMD architecture is close to the MIMD concept. There
is only one program for all the processors, but each one operates
independently of the others. Again, synchronization needs to be
insured by the developer. This architecture fits well with the FDTD
algorithm. The three-dimensional space is divided into several sub-
spaces, and each one is associated with a processor. All the proces-
sors execute exactly the same FDTD program, but each one oper-
ates on its own subspace (Figure 1). Note that the SPMD architec-
ture can be transformed into the MIMD architecture by conditional
Memory zone
I
branches. Figure 2. The definition of a process.

96 /€E€ Antennas and Propagation Magazine, Vol. 43, No.2,April 2001

Memory PO Memory PI < anticipated, and performance may therefore need to be optimized.
Blocking tx and reception The transmission can be blocking (the emitter process stops until
01 the addressee process has received the complete message - (Figure
-
- I I
3), or non-blockzng (blocking only during the temporary copy to
buffer the buffer: Figure 4). The reception can be non-blocking in certain
situations. In this case, a great deal of attention is necessary in
Emitter Receiver order to avoid data conflicts. More complex communications exist,
such as collective exchanges (in which all processes are involved),
Figure 3. The transmitter process Po stops until the addressee which are useful for matrix
process P, has received the complete message (MPI-SSEND):
blocking reception (MPI-RECV).
3. Massively Parallel FDTD Algorithm

a
Memory PO Memory P 1
3.1 Creation of Cartesian Topology

{. The first step of the parallel algorithm is the equal distribu-

buffer tion of the three-dimensional problem space among the processes.
Non-blocking tx With the MPI library, we have the choice of defining a Cartesian
R ceiver topology on the three-dimensional volume, in order to facilitate the
distribution and the communication between neighboring proc-
Figure 4. Blocking transmission only during the temporary esses. In Figure 5 , the computation volume is divided into nine
COPY to the buffer of Po (MPI-SEND), and blocking reception subspaces (0 to 8), according to a two-dimensional topology
(MPI-RECV) until the data are received. applied on the x-y plane. No division is done along the z axis. This
topology is suitable for thin structures, such as microstrip circuits
and antennas, where the thickness along the z axis is small (less
than 100 cells).

Each process can be addressed by its Cartesian coordinates or

by an ID number. For example, in our two-dimensional topology,
each process has a subspace localized by an ID number and its
coordinates (Figure 5). The two-dimensional Cartesian topology is
created by the following procedure:
oordinates (2,3) of
Call MPI-CART-CREATE (MPI-COMMWORLD,Ndim,
dim (1: 2),period (1:2) ,Organ,COMM_2D,err)

where

Figure 5. A two-dimensional Cartesian topology of processes

(x-y plane).

each process obtains its own data, which can be different in each
case.

2.3 Communication types

The fundamental communication is the transmission of a

message by one process and the reception of it by another process.
Required arguments for the message passing are the buffer address,
the data type, the number of data elements, the ID number of either
the receiver or the emitter, a tag to identify the message being
passed, and the communicator name to describe the exchange con-
text.

In the MPI library, several types of communication ire

defined. These are used, for example, to optimize the communica-
tions, to secure them, or to synchronize the transmitter and the Figure 6. The communications for a two-dimensional topology
receiver. The communication type used also depends on the paral- of processes (x-y plane). The dotted contour indicates the
lel machine used, because all possible instructions cannot be external limit of the subspace 4.

IEEE Antennas and PropagationMagazine, Vol. 43,No. 2, April 2001 97

it. In each subspace, a boundary-condition problem appears. To
calculate the field in the boundary cells, one needs to know the
field in the cells belonging to the neighboring subspaces. Let us
-b consider subspace 4, as illustrated in Figure 6 . In order to calculate
Y the component E, in the plane y = 1, the components H,, local-
ized in the cells of the plane y = 0 , are needed. But these compo-
nents belong to the subspace “WEST”. So reception of the H,
values is necessary for all the cells of the x-z plane at the boundary
y = 1. In the same manner, the subspace number 4 must send the
H , values to subspace “EAST” over all the x-z plane at the bound-
ary y = 6 . The complete send-receive communications of any sub-
space are indicated in Figure 6. Note that the East-West communi-
cations operate on the x-z plane, since the South-North communi-
cations operate on the y-z plane.
X
In order to receive the values communicated by the neigh-
boring subspaces, each field component’s array size is augmented
Figure 7a. There is a discontinuity of the data ‘along row num- by 1. For example, if each process treats a volume of m x ny x nz
ber 3. cells, the field-component arrays are augmented as follows:

E; (1 : m , l : n y , l : nz) a E; (1 : nx +1,1: ny +1,1: nz)

with i = [ x , y , z ] ,

H j ( l : m , l : n y , l : n z ) a H,(O:nx,O:ny,l:nz)
with i = [x,y , z ]

. With multiprocessor execution, these communications gener-

ate an additional time, known as overhead. To minimize the over-
head, the number of communication instructions must be mini-
mized. To avoid m x nz communication instructions (for a surface
x-z), the data should be grouped into one block before the transfer.
To do this, MPZ offers the so-called “derived data types,” which
group the data even when their memory addresses are not contigu-
ous. For FDTD problems, we use two “derived data types” to
optimize the communication: The first one is for the data in the x-z
x* plane, and the second is for the x-y plane: To define them, the data
disposition in a three-dimensional array must be known. In
Figure 7b. The data are contiguous along column number 3.
FORTRAN 90, for a three-dimensional array (i, j,k) correspond-
ing to the ( x , y , z ) directions, the data are contiguous along the x
axis (column), and discontinuous along the y axis (row) and the z
MPI-COMM-WORLD default communicator for
:
axis (Figure 7). The creation of the matrix blocks for the interfaces
a l l processes
Ndim = 2 topology dimension (2D here) z-x andy-z is realized in two steps:
dim (1:2) number of processes in each
of the two directions Application to the interface z-x; definition of a column
period(l:2)=false : periodicity in the two vector type (mpi-type-contiguous) and then a matrix type
directions (nothing here)
imposed organization of the (mpi-t ype-hve ctor):
Organ=false
processes (false here)
COMM-2D new communicator for the Call MPI-TYPE-CONTIGUOUS(nxt1,MPI-REAL,TypeVectX,
processes of the topology err) !creation
Call MPI-TYPE-COMMIT(TypeVectX,err) !activation
Call MPI-TYPE-HVECTOR(nz,l, (nx+l)* (ny+l),
TypeVectX,TypeMatrXZ,err) !creation
3.2 Communication Between Call MPI-TYPE-COMMIT(TypeMatrX2,err) !activation
Neighboring Processes
Application to the interface y-z; definition of a row vector
type (mpi-type-vector) and then a matrix type
3.2.1 Communication Implementation (mpi-type-hvector):
Using ‘Derived Data Types”
call MPI-TYPE-VECTOR(ny+l, l,nx+l,MPI-REAL,
TypeVectY,err) !creation
The problem of communications between the neighboring call MPI-TYPE-COMMIT(TypeVectY,err) !activation
subspaces is a delicate one, and one should pay careful attention to Call MPI-TYPE-HVECTOR(nz, 1, (nx+l)* (nytl),

90 /E€€ Antennas and Propagation Magazine, Vol. 43,No, 2. April 2001

TypeVectY,TypeMatrYZ,err) !creation 3.2.2 Performance of the
call MPI-TYPE-COMMIT(TypeMatrYZ,err) !activation
Parallel Algorithm
Now we have two new data types: “TypeMatrXZ” and
“TypeMatrYZ,” which can be used for communication between We will first compare the computational time with the com-
the neighboring processes. munication time, in order to evaluate the efficiency of the parallel
program. A computation volume of 60 x 60 x 50 cells was defined,
For each field component, all the processes simultaneously and the number of time iterations was limited to 200. The number
send the data at a boundary, and simultaneously receive at the of processes varied from two to 20 along the y axis (one-dimen-
opposite boundary ( y = O and y = n y ) . The transmission and sional topology). The computations were performed on the Cray
reception can therefore be done with only one instruction. The pro- T3E machine (256 processors) of IDRIS. IDRIS is the CNRS’s
cedure “SendRecv” performs both operations. For example, the national center for high-end supercomputing. Note that a single
E, component values are exchanged at the x-z interface ( y = 1for processor of this machine can accept only one process. For the
the transmission‘and y = ny + 1 for the reception) by the following communications, we considered two cases. In the first, the data
instruction: were transmitted by vector, so the number of communication
instructions was eight components x 50 (50 is the number of cells
Call MPI_SENDRECV(Ex(l,l,l), TypeMatrXZ, along the z axis). In the second case, the data were transmitted by
neighbor (WEST), tag, Ex (l,Ny+l,1), matrix, as described in the previous section (53.2.1). The results
TypeMatrXZ, neighbor(EAST), tag, COMM-ZD, are shown in Figure 8. As expected, the communications by matrix
status,err) blocks were more rapid than were the communications by vector
Ex(l,l,l), Ex(l,Ny+l,l) : Addresses of the
blocks. The computational time of the E-H components decreased
block to send and to
receive. as l/p (where p is the number of processes, equal here to the num-
TypeMatrXZ : Derived type ber of processors). The size of the subspaces should not be too
describing the block small, because the communication time then becomes non-negligi-
to send and to
receive. ble. With eight processors, we obtained a computational time that
neighbour (WEST), neighbour (EAST) was slightly higher than normal. In fact, the volume of subspaces
: emitter id number was not strictly equal, because the result of 60 + 8 is not an integer.
and receiver id As the subspaces were small, the load difference raised the com-
number
putational time.
tag : message tag
COMM-2 D : communicator name
The performance of a parallel code is characterized by the
Eight communication instructions are executed at each time following parameters:
iteration. They concem the Ey , E,, H , , and H , components for
the y-z interface, and the E, ,‘E,, H , , and H, components for the
x-z interface.

Now, we can write the global parallel algorithm from the

various different points discussed thus far. The algorithm describes
mainly the parallelism steps:

1. MPZinitialization ($2.2)
+Communication
Determination of the process number and their ID number

2. Reading of simulation parameters

3. Creation of two-dimensional Cartesian topology (53.1)

4. Creation of the derived data types for communication purpose

($3.2.1)

5. Start time iterations (time-stepping)

Computation of the E-field components (Equation (3b))
Communication of the E-field components at the subspace
boundaries ($3.2.1)
Computation of the H-field components (Equation (3a))
0 2 4 6 8 10 12 14 16 18 20
Communication of the H-field components at the subspace Number of processors
boundaries (53.2.1) Figure 8. The computational time of the E-H components,
compared to the communication time, for a volume of 60.x 60 x
6. End (MPI-FINALIZE) 50 cells.

/€€E Antennas and Pfopagafion Magazhe, Vol. 43, No, 2,April 2001 99
The efficiency is presented in Figure 10 as a function of the
number of processors. This figure shows that decomposition along
the y axis (1 x Nprocs) is more efficient than along the x axis
(Nprocsx 1). This is due to the continuity of the data in the mem-
ory along the x axis. As a consequence, the memory is accessed
more quickly.

With two processes we have an abnormal efficiency, greater

than one. This is, in fact, a situation where the communication time
is negligible (1 s against 58 s for E-H computation), while the
number of cells is only half as many as in a mono-processor exe-
cution. This situation is perhaps more favorable, because the three-
dimensional arrays are smaller, and the memory access is shorter.
[Editors’ note: This is also know as “super-linear” speed-up; it is
usually caused by the higher caching performance permitted by the
smaller data sets on the multi-processor system.] With two proces-
sors, therefore, the communication between processes is executed
slightly more quickly.
0‘ 2 4 6 8 10 12 ’ 14 16 18 20
Number of processors = 1*N, The results shown in Figures 9 and 10 also represent extreme
cases, where the number of cells per subspace was small. In most
Figure 9. The scalability as a function of the processor number, of the simulations, the subspaces had a size of about 50 cells in the
for a volume of 60 x 60 x 50 cells. x or y direction. This ensured a very high efficiency. As the number
of cells diminishes, the efficiency decreased. The efficiency was
greater than 90% for subspaces of more than 20 cells per direction.
E=S/n
%

100 3.3 Parallel Implementation of Other Tasks

In the FDTD algorithm, many other tasks are possible besides

80 the computation of E-H components, for example, PMLs [ 121, far-
field computation [ 13, 141, thin-wire formalism [15], plane-wave
60 excitation [16, 5 , 171, dispersive media [18, 191, voltage and cur-
-
. I ~21pr:q ~

Communication
rent computations, etc. Among these additional tasks, some take a
negligible amount of time, and others take longer. This fact should
40 ...................................................................................................... ............... be taken into account in the distribution of the cells among the dif-
by vector
ferent subspaces, in order to preserve load balancing. This is par-
? Communication ticularly true for PMLs. On the other hand, many of the additional
20 ............
....................................,.........................................I By matrix ...............
tasks do not need communication instructions, except for the post-
processing stage, where the data must sometimes be grouped in a
master processor. The far-field computation is an example of post-
2 4 6 8 10 12 14 16 18 20 communication. For the time iterations, each process calculates the
Number of processors = N,*N, current sources on the Huygens surface and their contribution to
the far field. After the time stepping has been completed, the con-
Figure 10. The efficiency as a function of the processor uum- tribution of all processes to the far-field values is communicated to
ber, for a volume of 60 x 60 x 50 cells. a unique process. This last process reconstructs the far-field values
and then stores them.

When a task does not require any communication, the paral-

The scalability: S = 7; /T,
lelism concems only the data distribution among the processes.
The rule is that each process executes all the tasks localized in its
The efficiency: E = S/n subspace.

The scalability is the ratio between the execution time on one Two cases may be considered:
process ( T , ) and the execution time on “n” processes (T,). This
parameter shows directly the gain in time with n processes. The 1. The task corresponds to a spatial point (field storage, local
efficiency is the ratio of the scalability to the process number. excitation,....). In this case, only one process is concemed, and
the parallelism consists in transmitting the coordinates and the
In Figure 9, the scalability of the algorithm clearly showed treatment to this process.
the advantage of matrix communication, notably for small sub-
spaces (more than 10 processors, or subspace less than 60 x 6 x 2. The task extends over several subspaces. Each process thus
50). For eight processors, the aforementioned anomaly appeared executes its own portion of the task. It must then verify
clearly, due to the unequal process loads. whether communication is required with other processes.

100 /€E€ Antennas and Propagation Magazine, Vol. 43,No.2, April 2001
Consider, for example, the voltage calculation between two
points, A and B. This is realized by the summation of the electrical
field along a line joining the points A and B. Each process needs to
h o w if its subspace contains a portion of the summation line.
Then, it needs to know the two limited points on its own subspace.
Finally, after the time stepping, the processes concemed in the
voltage calculation communicate their values to a master process,
in charge of calculating the final voltage and storing it.

3.3.1 Standard Form

In general, the zones on which a’given treatment is applied

3
take the following standard forms:

1. Point (e.g., local excitation, local field storage)

along x 1-1 along y
2. Line (e.g., voltage computation, thin-wire formalism)
Figure 12a. The communication time as a function of the proc-
3. Rectangular contour (e.g., current computation). This may be ess position in the two-dimensional topology, with nine proces-
decomposed into four lines, in which case it, can be equivalent sors. The number of iterations is limited to 100. The size of the
to the previous example. volume is 150 x 150 x 50 cells, surrounded by 10 PML layers.

4. Surface (current calculation on a Huygens surface for far-field

evaluation, plane-wave injection, cartography, etc.). Total time: 23 s. I
It is important to define some general procedures for the standard
forms, in order to avoid repeating errors in the data distribution.
For each standard form, some suitable procedures must be
constructed, in order to automatically find the processes concemed,
and the coordinates of the portion treated, relative to the subspace.
For example, we have developed a procedure conceming plane sur-
faces, which is executed by each process. This procedure deter-
mines whether a given process contains any part of the surface. It 4
then evaluates the relative coordinates of the surface portion in the
associated subspace.

3.3.2 Problems Associated with

c/ Process

PML Regions
Process 4
I
1

The PML regions have a thickness varying from four to 12 along x

cells. They introduce a heavy computational load on the cells near Figure 12b. The communication time as a function of the proc-
ess position in the two-dimensional topology, with 16 proces-
sors. The number of iterations is limited to 100. The size of the
volume is 150 x 150 x 50 cells, surrounded by 10 PML layers.

the boundaries of the volume, due to the split field components. In

order to easily obtain an equal load in all of the subspaces, the first
solution considered with the Berenger’s PMLs [12] is to split the
field components everywhere (including the inner region). The
cells in the PML region and the inner region thus carry the same
computational load. However, with this approach the FDTD pro-
gram needs to be reviewed, because the split components in the
inner region require some modification to all treatments. Our
approach is based on the GUEHPMLs (Generalized un-split E-H
PMLs) algorithm, described in [20]. The GUEHPMLs can absorb
the outgoing wave with no reflection, even in complex dispersive
Figure 11. The computational space divided into nine sub- media or lossy media. Furthermore, the algorithm given in [20] is
spaces. The inner subspace 5 contains more cells than the oth- easy to implement, because E and H fields are not split in
ers. GUEHPMLs, and the algorithm is applied in the inner region

/€E€ Antennas and Propagation Magazine. Vol. 43, No. 2, April 2001 101
without modification in the PML regions. Only four additional of- processes. Moreover, the communication algorithm was opti-
components are needed in each PML region. Besides, no additional mized with the help of the “derived data type,” which permits the
communications are required with the GUEHPMLs. The parallel- data to be grouped. The performance was shown for one-dimen-
ism consists of balancing the number of cells between the proc- sional and two-dimensional topologies. From 20 cells per direction
esses containing a part of the PMLs, and those processes not con- upward, the efficiency was greater than go%, and it increased with
taining them. the size of the subspaces. The approach for the parallel implemen-
tation of the auxiliary tasks was also described. It appears that the
Let us consider a factor Fa such that parallelism consists mainly in distributing the data and the specific
treatments to each process. For most of our FDTD simulations, the
computational time was reduced to less than half an hour. Finally,
we have shown that the PMLs can be treated in an efficient parallel
TcellulePML is a cell calculation time in the PML, and TceNule ut,le is
algorithm if a non-split field formulation is used.
a cell calculation time in the inner region. The factor Fa permits an
estimate of the number of cells in each subspace. For example, in
Figure 11 we divided a space into nine subspaces. It is then easy to 5. References
determine the number of cells along the x axis for the processes 2,
5, and 7, and along the y axis for the processes 4, 5, and 6. 1. A. Taflove, Advances in Computational Electrodynamics,
Although the four processes in the corner (1, 3, 7, 9) seem to have Artech House, 1998.
smaller subspaces, their load is close to that of the other processes,
because the PML comer zones have twice as many additional 2. W. J. R. Hoefer, “The Transmission-Line Matrix Method and
instructions as those processes in the PML edge regions. Applications,” IEEE Transactions on Microwave Theory and
Techniques, MTT-33, 10, October 1985, pp. 882-893.
To demonstrate the efficiency of the parallel FDTD algorithm
including the GUEHPMLs, we used nine (3 x 3) and 16 (4 x 4) 3. J. Lee, R. Lee, and A. Cangellaris, “Time-Domain Finite-Ele-
processors on the Cray T3E. The computation was performed in a ment Methods,” IEEE Transactions on Antennas and Propagation,
volume size of 150 x 150 x 50 cells, surrounded by 10 PML layers. AP-45,3, March 1997.
The number of time iterations was limited to 100. Figure 12 shows
that the efficiency was greater than 85% in both cases (nine and 16 4. K. S. Yee, “Numerical Solution of Initial Boundary Value
processors). In Figure 13, the computational time is compared Problems Involving Maxwell’s Equations in Isotropic Media,”
among the Cray T3E (300MHz) and two recent PCs (400MHz IEEE Transactions on Antennas and Propagation, AP-14, May,
and 800 MHz). The PC800 was equivalent to only two processors 1966, pp. 302-307.
of the Cray T3E. The nine processors were equivalent to an ideal
operating number of eight processors, while 16 processors were 5. A. Taflove, Computational Electrodynamics, Nonvood, Massa-
equivalent to an ideal operating number of 14 processors, because chusetts, Artech House, 1995.
the gains were 8 and 14, respectively.
6. W. Gropp, E. Lusk, and A. Skjellum, Using MPI. Portable Par-
allel Programming with the Message Passing Interface, Cam-
bridge, Massachusetts, MIT Press, 1994.
4. Conclusion
7. V. Varadarajan and R. Mittra, “Finite-Difference Time Domain
(FDTD), Analysis Using Distributed Computing,” IEEE Micro-
We have presented a parallel FDTD algorithm, which may be
wave and Guided Wave Letters, 4 , 5 , May 1994, pp. 144-145.
easily implemented with the MPI library. The parallel computation
of the E-H components has been explained step by step, and the
8. K. C. Chew and V. F. Fusco, “A Parallel Implementation of the
MPI instructions have been given for a two-dimensional topology
Finite Difference Time Domain Algorithm,” International Journal
of Numerical Modeling Electronic Networks, Devices and Fields,
8, 1995, pp. 293-299.

9. S . D. Gedney, “Finite-Difference Time-Domain Analysis of

Microwave Circuit Devices in High Performance Vector@arallel
Computers,” IEEE Transactions on Microwave Theory and Tech-
niques, MTT-43, 10, October 1995, pp. 2510-2514.

10. Z. M. Liu, A. S. Mohan, T. A. Aubrey, and W. R. Belcher,.

“Techniques for Implementation of the FDTD Method on CM-5
Parallel Computer,” IEEE Antennas and Propagation Magazine,
37, 5, October 1995, pp. 64-71.

11. A. Aemamra, G. Guiffaut, M. Mahdjoubi, and C. Terret,

“Technique de Developpement d’un Algorithme Massivement Par-
Figure 13. A computational time comparison between PCs and allkle sur la Mtthode FDTD3D,” JNM’99, Arcachon, May 1999.
the Cray T3E. The number of iterations is limited to 100. The
size of the volume is 150 x 150 x 50 cells, surrounded by 10 12. J. P. Btrenger, “Perfectly Matched Layer for the FDTD Solu-
PML layers. tion of Wave-Structure Interaction Problems,” IEEE Transactions

102 /€E€ Antennas a n d Propagation Magazine, Vol. 43,NO. 2, April 2001

on Antennas and Propagation, AP-44, I , January 1996, pp. 110- 17. C. Guiffaut and K. Mahdjoubi, “A Perfect Wideband Plane
117. Wave Injector for FDTD Method,” IEEE Intemational Symposium
on Antennas and Propagation, Salt Lake City, USA, July 16-21,
13. R. Luebbers, K. Kunz, M. Schneider, and F. Hunsberger, “A 2000.
Finite-Difference Time-Domain Near Zone to Far Zone Transfor-
mation,” IEEE Transactions on Antennas and Propagation, AP- 18. D. F. Kelley and R. J. Luebbers, “Piecewise Linear Recursive
39,4, April 1991, pp. 429-433. Convolution for Dispersive Media using FDTD,” IEEE Transac-
tions on Antennas and Propagation, ’AP-44, 6, June 1996, pp. 792-
14. T. Martin, “An Improved Near- to Far-Zone Transformation for 797.
the Finite-Difference Time-Domain Method,” IEEE Transactions
on Antennas and Propagation, AP-46, 9, September 1998, pp. 19. Q. Chen, M. Katsurai, and P. Aoyagi, “An FDTD Formulation
1263-127 1. for Dispersive Media Using a Current Density,” ZEEE Transac-
tions on Antennas and Propagation, AP-46, 10, October 1998, pp.
15. R. Holland and L. Simpson, “Finite-Difference Analysis EMP 1739-1745.
Coupling to Thin Struts and Wires,” ZEEE Transactions on Elec-
tromagnetic Compatibility, EMC-23,2, 1981, pp. 88-97. 20. C. Guiffaut and K. Mahdjoubi, “Generalised Unsplit E-H
PMLs with JEC Technique for Dispersive Media,” Intemational
16. K. R. Umashankar and A. Taflove, “A Novel Method to Ana- Symposium on Antennas and Propagation 2000, Fukuoka, Japan,
lyze Electromagnetic Scattering of Complex Objects,” IEEE August 21-25,2000. ‘9’
Transactions on Electromagnetic Compatibility, EMC-24, 1982,
pp. 397-405.

In Memoriam: Warren L. Flock

Professor Emeritus Warren L. Flock died March 4, 2001, in He was the author of books, many joumal articles, and a
Boulder, Colorado. He was 80. Bom October 26, 1920, in Kellogg, famous NASA handbook on Earth-space radio wave propagation.
Idaho, he was the son of Abraham L. Flock and Florence Ashby His studies in radar omithology aided the understanding of radar
Flock. He married JoAnn Walton in Altadena, Califomia, on blackouts on the DEW line in Alaska and Canada, and of air-traffic
July 20, 1957. control and bird-migration pattems. The citation for his IEEE Fel-
low election read “For contributions in the application of radar
Dr. Flock received his BS in Electrical Engineering degree echoes for avoiding aircrafl-bird collisions.”
from the University of Washington in 1942; his MS in Electrical
Engineering from the University of Califomia, Berkeley, in 1948; He was a Fellow of the Explorers Club, and did indeed
and his PhD in engineering from the University of Califomia, Los explore the world, from Antarctica to Wrangel Island in the East
Angeles, in 1960. Siberian Sea with his wife, JoAnn, who has a PhD in botany. He
remained physically and professionally active up until two years
During WWII, he was a member of the staff of the MIT prior to his death.
Radiation Laboratory. He was a Lecturer and Associate Engineer at
UCLA from 1950 to 1960. He was Associate Professor of Geo-
physics from 1960 to 1962, and Professor of Geophysics at the
Geophysical Institute of the University of Alaska from 1962 to
1964. He joined the University of Colorado as Professor of Electri- Emest K. Smith
cal Engineering in 1964, and became Professor Emeritus in 1987. Campus Box 425
Electrical & Computer Engineering Department
He served as a consultant to the Rand Corporation; a Con- University of Colorado
sultant in Residence at the Jicamarca Radar Observatory in Lima, Boulder, CO 80309 USA
Peru; a member of the technical staff of the Caltech Jet Propulsion Tel: +I (303) 492-7123
Laboratory; and a visiting scientist at the Geophysical Observatory, Fax: +1 (303) 492-2578
DSIR, in Christchurch, New Zealand. E-mail: Smithek@boulder.colorado.edu G