Time-Domain Simulation of Large Electric Power Systems Using Domain-Decomposition and Parallel Processing Methods
Time-Domain Simulation of Large Electric Power Systems Using Domain-Decomposition and Parallel Processing Methods
Institut Montefiore
Département d’Electricité, Electronique et Informatique
Petros Aristidou
Acknowledgments v
Abstract vii
Nomenclature ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Power system modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Model overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Numerical integration methods . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Time-step selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Treatment of discrete events . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.5 Dealing with algebraic and differential equations . . . . . . . . . . . . . 8
1.2.6 System reference frame . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Description of power system models used in this work . . . . . . . . . . . . . . 11
1.3.1 Nordic system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Hydro-Québec system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 PEGASE system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Thesis objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Think parallel 19
2.1 The motivation for multi-core processors . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Types of parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Algorithm-level parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Data-level and task-level parallelism . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Instructional parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
i
ii CONTENTS
Appendices 153
Bibliography 181
Acknowledgments
I started writing this manuscript with the “simple” purpose of digesting the work of four and
a half years (the duration of my doctoral studies) into approximately 200 pages. Almost
immediately, I realized that I was not the only contributor to this work. In the following few
paragraphs I will try to acknowledge the people that played a role, smaller or bigger, in making
this PhD a reality. I will undoubtedly forget or omit some of them, and I apologize and thank
them in advance.
First and foremost I offer my sincerest gratitude to my advisor, Professor Thierry Van
Cutsem. When I arrived in Liège in 2010, I had little knowledge of what research is. Over
the next years, he devoted much of his time to transfer to me his theoretical and practical
understanding of power systems, and to help me develop valuable skills as a researcher and
as an academic. He has always encouraged and trusted me to venture on new ideas, even
when those diverted from my original research plans. In addition to our academic interests,
we also shared a love for photography; on several occasions we would spent our spare time
exchanging moments captured on “film” and discussing some new photography equipment
or technique. Overall, he has been an exceptional mentor and an outstanding friend, on
whom I could always rely for help and guidance.
I wish to express my gratitude to each member of the examining committee, for devot-
ing their time to read this report. During my stay in Liège, Professors Patricia Rousseaux,
Christophe Geuzaine, and Damien Ernst have offered concrete support both as members
of my dissertation committee but also through our enriching discussions. A special thanks
to Professor Costas Vournas who was the first one to introduce me to the world of power
system dynamics and has supported and advised me over the years. I would also like to
thank Dr. Mevludin Glavic for the insightful discussions during our coffee breaks outside the
department entrance, but also for transferring his experience to help me make some impor-
tant decisions in my life. I wish to thank Professor Xavier Guillaud, his support and feedback
on my work have been valuable for my PhD. Many thanks to Simon Lebeau (TransÉnergie di-
vision, Hydro-Québec) and Patrick Panciatici (RTE) for providing a valuable input to my work
based on real engineering problems. Upon arriving in Liège, I was welcomed by Professor
v
vi Acknowledgments
Mania Pavella, who showed great interest in my academic development and my general well-
being, and I’m thankful for that. I am also really grateful to the Bodossaki Foundation and Mr.
Sotiri Laganopoulo, for the invaluable support and guidance they offered.
Throughout the process, I have greatly benefited from the work and feedback of Professor
Van Cutsem’s former and current students and visitors. I would like to thank Dr. Davide
Fabozzi for his support during my first two years in Liège. He has proved to be a valuable
source of information (on technical and non-technical topics alike) and a dear friend. I tried
to match his enthusiasm and provide the same level of support to my junior PhD colleagues
Lampro Papangeli and Hamid Soleimani. Their collaboration and friendship made the last
two years more enjoyable. Special thanks to Frédéric Plumier, for being a close friend and a
great colleague. He patiently allowed me to “torture” his mother language so I can practice my
French, and introduced me into his family as the godfather of Antoine; for that, I will always be
indebted. Also, I would like to thank Professor Gustavo Valverde, from whom I have learned
many lessons, a superb researcher and most importantly an exceptional friend, and of course
to his lovely wife, Rebecca. My thanks to all the researchers who briefly joined the Liège
group, amongst which, Benjamin Saive, Dr. Tilman Weckesser, Dr. Spyros Chatzivasileiadis,
and Theodoros Kyriakidis. Your support and friendship are cherished. I am also grateful to
Dr. Efthymios Karangelos and Panagiotis Andrianesis, for promptly offering their support and
advice whenever needed.
I cannot list everyone here, but I am grateful to all my friends, the ones from Cyprus and
Greece, as well as the people I befriended during these last years in Belgium. My thoughts
are with you, wherever you are.
The last thanks go to my family, my sister Maria with whom we spent a big amount of time
consulting and listening to each other during our parallel PhD journeys; my sister Angela,
an inspiration and beacon to my academic ventures, and her husband Alex, for being my
unofficial academic advisors over the years, and for bringing to life Stella, a shinning star of
joy and happiness in our family. A warm thank you to Dafni, for all her love and support that
made these years more enjoyable and gave me the strength to see this journey through.
The end of my doctoral studies in Belgium signals 12 years since I left my parents’ home
in Pissouri, Cyprus. Nevertheless, my parents Christos and Stella, have been by my side
every single day since then; through my days in the military service, my diploma studies in
Athens, and my doctoral studies in Liège. Their love and support (material and emotional)
have allowed me to constantly leap forward in new endeavors without any fear, as I always
know that “they have my back”. I also want to thank my beloved grandmother Angeliki for
her heart-warming discussions over the phone and the delicious food she prepared for me at
every opportunity.
Dynamic simulation studies are used to analyze the behavior of power systems after a dis-
turbance has occurred. Over the last decades, they have become indispensable to anyone
involved in power system planning, control, operation, and security. Transmission system
operators depend on fast and accurate dynamic simulations to train their personnel, analyze
large sets of scenarios, assess the security of the network in real-time, and schedule the
day-ahead operation. In addition, those designing future power systems depend on dynamic
simulations to evaluate proposed reinforcements, whether these involve adding new trans-
mission lines, increasing renewable energy sources, or implementing new control schemes.
Even though almost all computers are now parallel, power system dynamic simulators
are still based on monolithic, circuit-based, single-process algorithms. This is mainly due to
legacy code, written in the 80’s, that is still today in the core of the most important commercial
tools and does not allow them to fully exploit the parallel computational resources of modern
computers.
In this thesis, two parallel algorithms belonging to the family of Domain Decomposition
Methods are developed to tackle the computational complexity of power system dynamic
simulations. The first proposed algorithm is focused on accelerating the dynamic simulation
of large interconnected systems; while, the second algorithm aims at accelerating dynamic
simulations of large combined transmission and distribution systems.
Both proposed algorithms employ non-overlapping decomposition schemes to partition
the power system model and expose parallelism. Then, “divide-and-conquer” techniques
are utilized and adapted to exploit this parallelism. These algorithms allow the full usage
of parallel processing resources available in modern, inexpensive, multi-core machines to
accelerate the dynamic simulations. In addition, some numerical acceleration techniques
are proposed to further speed-up the parallel simulations with little or no impact on accuracy.
All the techniques proposed and developed in this thesis have been thoroughly tested on
academic systems, a large real-life system, and a realistic system representative of the conti-
nental European synchronous grid. The investigations were performed on a large multi-core
machine, set up for the needs of this work, as well as on two multi-core laptops computers.
vii
Nomenclature
Abbreviations
ADN Active Distribution Network
API Application Programming Interface
ASRT Automatic Shunt Reactor Tripping
BDF Backward Differentiation Formulae
BEM Backward Euler Method
COI Center of Inertia
DAE Differential-Algebraic Equation
DCTL Dscrete controller
DDM Domain Decomposition Method
DG Distributed Generator
DN Distribution Network
DNV Distribution Network Voltage
DSA Dynamic Security Assessment
EMA Exponential Moving Average
EMT ElectroMagnetic Transients
GPU Graphics Processing Unit
HPC High-Performance Computing
HQ Hydro-Québec
IN Inexact Newton
IVP Initial Value Problem
LTC Load Tap Changer
LVFRT Low Voltage and Fault Ride Through
NUMA Non-Uniform Memory Access
ODE Ordinary Differential Equation
OHC OverHead Cost
OXL OvereXcitation Limiters
PV PhotoVoltaic
ix
x Nomenclature
Mathematical Symbols
D matrix that includes the real and imaginary parts of the bus admittance matrix
Γ diagonal matrix with (Γ)`` equal to 1 if the l-th equation is differential and 0 if it
is algebraic
ΓC , ΓSi projection of Γ on the Central or i-th Satellite sub-domain
Γi projection of Γ on the i-th injector sub-domain
Ai Jacobian matrix of injector i
Bi matrix of sensitivities of injectors equations to voltages in injector i
Ci trivial matrix with zeros and ones linking injector i currents to network equations
I sub-vector of x containing the bus currents
J Jacobian matrix of both network and injectors
V vector of rectangular components of bus voltages
x state vector containing the differential and algebraic variables
xC , xSi projection of x on the Central or i-th Satellite sub-domain
xi projection of x on the i-th injector sub-domain
L number of Satellite sub-domains
M number of parallel workers
NC , NSi number of injectors attached on the Central or i-th Satellite sub-domain network
T1∗ run-time of a program with one worker using the fastest (or a very fast) sequen-
tial algorithm
T1 the time an algorithm takes to run in sequential execution (M = 1)
T∞ the time an algorithm takes on an ideal machine with an infinite number of par-
allel workers (M = ∞)
TP percentage of time spent in parallel execution
TS percentage of time spent in sequential execution
C HAPTER 1
Introduction
1.1 Motivation
Dynamic simulations under the phasor approximation are routinely used throughout the world
for the purpose of checking the response of electric power systems to large disturbances.
Over the last decades, they have become indispensable to anyone involved in the planning,
design, operation, and security of power systems. Power system operators depend on fast
and accurate dynamic simulations to train operators, analyze large sets of scenarios, assess
the dynamic security of the network in real-time, or schedule the day ahead operation. On the
other hand, those designing future power systems depend on dynamic simulations to eval-
uate the proposed changes, whether these involve adding new transmission lines, increas-
ing renewable energy sources, implementing new control schemes, or decommissioning old
power plants.
Such simulations require solving a large set of nonlinear, stiff, hybrid, Differential-Alge-
braic Equations (DAEs) that describe the physical dynamic characteristics, interactions, and
control schemes of the system. A large interconnected transmission or a detailed transmis-
sion and distribution system may involve hundreds of thousands of such equations whose
dynamics span over very different time scales and undergo many discrete transitions im-
posed by limiters, switching devices, etc. Consequently, dynamic simulations are challenging
to perform, computationally intensive and can easily push any given computer to its limits.
In applications targeting the real-time monitoring and security of the system, for example
Dynamic Security Assessment (DSA), the speed of simulation is a critical factor. In the
remaining applications, speed is not critical but desired as it increases productivity. This is
the main reason why energy management systems often resort to faster, static simulations.
However, the operation of non-expandable grids closer to their stability limits and the
unplanned generation patterns stemming from renewable energy sources require dynamic
studies. Furthermore, under the pressure of electricity markets and with the support of active
demand response, it is likely that system security will be more and more guaranteed by
emergency controls responding to the disturbance. Thus, checking the sequence of events
1
2 CHAPTER 1. INTRODUCTION
that take place after the initiating disturbance is crucial; a task for which the static calculation
of the operating point in a guessed final configuration is inappropriate.
All modern computers are now parallel. Even the smallest ones, such as mobile phones,
offer at least one parallel feature, such as vector instructions, multithreaded cores, multi-
core processors, multiple processors, graphic processing units, or parallel co-processors.
The advent of parallel processing represents a revolutionary opportunity for power system
dynamic simulation software.
Unfortunately, the traditional approach to perform these simulations is based on mono-
lithic, circuit-based, single-process schemes. This is mainly due to legacy code, written in the
80’s, that is still today at the heart of the most important commercial tools for power system
dynamic simulations. Many of these programs are serial not because it was natural to solve
the problem serially, but because the programming tools demanded it and the programmers
were trained to think that way. This approach hinders the simulation performance, decreasing
productivity and increasing the overall cost.
Among the commercial software, PowerWorld, Digsilent Power Factory, ETAP (Electrical
Transient Analyzer Program), PSLF, EuroStag, PSS/E, Simpow, and CYME are well-known
simulators. To our knowledge, none of these widely used software offers multithreaded dy-
namic simulations yet. This means that they run only on one core of one processor, and
therefore far from fully utilize the whole power of the new parallel computers. On the other
hand, existing commercial parallel simulators (such as Opal-RT ePHASORsim) are based
on specialized parallel computers. Finally, due to their closed source, commercial software
do not provide full flexibility for experimentation and prototyping.
3. the analyzed dynamics have time constants of tenths to tens of seconds or equivalently
0.1 to 10 Hz.
0 = Ψ(x, V ) (1.1a)
Γẋ = Φ(x, V ) (1.1b)
x(t0 ) = x0 , V (t0 ) = V0 (1.1c)
where V is the vector of voltages through the network and x is the state vector containing the
remaining (except voltages) differential and algebraic variables of the system. Furthermore,
Γ is a diagonal matrix with:
0 if `-th equation is algebraic
(Γ)`` = (1.2)
1 if `-th equation is differential
The algebraic Eq. 1.1a describes the network and can be rewritten as:
0 = DV − I , g (x, V ) (1.3)
where D includes the real and imaginary parts of the bus admittance matrix and I is a
sub-vector of x containing the bus currents [Fab12].
The initial voltages V0 and currents I0 are usually obtained by performing a power-flow
computation of the static power system model or are received from a state estimation soft-
ware. Next, the remaining DAE states x0 are computed through an initialization procedure.
Equation 1.1b describes the remaining DAEs of the system including the dynamics of
generating units, their controls, dynamic loads, and other devices. Together these equations
form a complete mathematical model of the system, which can be solved numerically to
simulate the system behavior.
system transient stability simulations the index-1 condition is assured for most of practical
cases except for some operating conditions close to voltage collapse [LB93]. The theory be-
hind numerical methods for general DAE systems is very complex. Fortunately, semi-explicit
index-1 systems can be directly discretized using classical numerical integration methods
for ODEs, while adapting the solution algorithm to take take into account the algebraic con-
straints [BCP95]. These integration methods fall into two general categories: explicit or
implicit, the latter of which will be detailed in the sequel.
Power system dynamic simulation models involve stiff DAEs. A stiff problem is one in
which the underlying physical process contains components evolving on widely separated
time scales [BCP95], or equivalently the dynamics of some part of the process are very fast
compared to the interval over which the whole process is studied. In linear systems, stiffness
is measured by the ratio of the largest to the smallest eigenvalue. Practically, this feature of
system (1.1) requires integration methods with properties such as A-stability and stiff decay
(or L-stability) [BCP95, MBB08].
Ideally, a numerical integration method should mimic all properties of the differential problem,
for all types of problems. However, this is not possible. Thus, the integration method used
should captures at least the essential properties of a class of problems. In this section, the
discussion is concentrated on implicit integration methods because they can be effectively
used to solve stiff DAEs [BCP95].
Let us consider the following ODE:
ẏ = c (t, y) (1.4)
where h is the integration step length, β 0 is a coefficient that depends on the actual integration
method, c (yk+1 ) is the right-hand side of the differential equation calculated at the value yk+1 ,
and:
β k = y k + ∑ a j c ( y k +1− j ) (1.6)
j
ẏ = λy (1.7)
1.2. POWER SYSTEM MODELING 5
25
6
15 4
4
10
3
5 2 2
1
0
0
-5
-2
-10
-4
-15
-20 -6
-25
-15 -10 -5 0 5 10 15 20 25 30 35 0 2 4 6 8 10
(a) Trapezoidal method (b) BDF orders 1-4
where c(t) is a bounded function. Assuming that the system is stable, an integration method
has stiff decay if, for a given tk > 0:
The practical advantage of stiff decay methods lies in their ability to skip fine-level (or
fast varying) dynamics and still maintain a good description of the solution on a coarse level
[BCP95]. Thus, the choice of the time-step size is not dictated strictly by accuracy, but
by what should be retained in the final response [Fab12]. Conversely, integration methods
without stiff decay (such as the Trapezoidal method) need to be used with small enough time
steps even if only the coarse behavior of the solution is sought, otherwise errors propagate
and numerical oscillations occur [GSD+ 03].
The simplest BDF method is the Backward Euler Method (BEM), formulated as:
A very important feature of BEM is that it allows to integrate over a discontinuity. In the
corresponding Eq. 1.11, if the discontinuity takes place at time tk , only the derivative at time
tk+1 is used to compute yk+1 . Therefore, this scheme can be used for the time step that
immediately follows a discontinuity.
Another well-known member method of this family is the second-order BDF (BDF-2),
formulated as:
4 1 2
y k − y k −1 + h c ( y k +1 )
y k +1 = (1.12)
3 3 3
As seen in Fig. 1.1b, BDFs of order larger than two are not A-stable. Moreover, all BDFs
have a region of hyper-stability (although it decreases when the order increases).
Further analysis on the properties and applications of ODE methods can be found in
dedicated references, such as [Gea71, BCP95, HW96]. More specific and recent develop-
ments on numerical integration methods used in power system applications can be found in
[SCM08, MBB08, Mil10, ES13]. The integration formula of choice in this thesis is the sec-
ond-order BDF, both for the benchmark and the proposed methods. This formula is initialized
by BEM, which is also used in case of discontinuities.
decide whether to accept the results of the current step or to redo with a smaller step size.
Alternatively, it can be used to select the next time-step size to be as big as possible within
the LTE tolerance.
The second strategy consists in varying the time-step size according to the estimated
computational effort, based on the assumption that the computational effort varies with the
step size. This can be done either by relying on the magnitude of the mismatch vector
[FV09], or on the number of Newton iterations [Fab12], originally proposed in [Sto79], for
error estimation. In this work, the latter is followed.
Although variable time step algorithms allow increasing the step size (provided that the
error estimation remains within the desired bounds), they do so assuming a continuous,
non-discrete, system of DAEs. However, power system models are hybrid systems1 : during
the simulation they undergo discrete transitions. The occurrence of these discrete changes
should be factored in the selection of the time-step size as discussed in the following section.
The hybrid nature of power system models [HP00] makes them more problematic to handle
as the discrete changes force discontinuities in Eqs. 1.1. In the general case, whether the
solver uses constant or variable step size, those instants where discrete variables change
do not coincide with a simulation time step. From a strict mathematical viewpoint, it is not
correct to integrate over these discontinuities; in fact, all numerical integration methods are
based on polynomial expansions which cannot reproduce discontinuities [Fab12].
One option is to identify the time t∗ of the discrete event (e.g. using zero-crossing function
analysis) and adjust the step size to coincide with the upcoming event. Next, Eqs. 1.1 are
updated and the resulting discontinuity is solved appropriately. Then, the solution proceeds
with next time step [Cel06]. This mechanism leads to accurate simulations but imposes a limit
on the time-step size allowed and a reduction of the average step size. Moreover, it increases
the computational burden for the identification of discrete event timings (e.g. solution of zero-
crossing functions).
On the other hand, if the step size is not reduced, the discrete event and the update of
equations are shifted in time. This may cause variables to converge to wrong values, or even
not converge, if some modeling and solving precautions are not taken. An appropriate ex post
treatment of these shifted discrete events has been proposed in [FCPV11]. In brief, the time
step t → t + h is computed and the solution is checked for any occurring discrete transitions
within this interval. If it is detected that a discrete change has occurred, Eqs. 1.1 are updated
accordingly and the same time step is computed again. This procedure is repeated until no
discrete transitions occur any longer or a maximum number of jumps is reached. Then, the
simulation proceeds with a new step. This approach is followed in this work.
1 The term “hybrid system” shouldn’t be confused with hybrid transient simulations that indicates tools that
integrate together EMT and phasor mode models
8 CHAPTER 1. INTRODUCTION
DAEs
Partitioned Simultaneous
(Alternating) (Direct)
Solution
• the way that algebraic and differential equations are interfaced: partitioned or simulta-
neous
Thus, the solution approaches can be divided into two main categories [DS72, Sto79, MBB08,
FMT13] as shown in Fig. 1.2.
In the partitioned approach, the differential and the algebraic equations in (1.1) are solved
independently. At each step of the numerical integration procedure the partitioned scheme
alternates between the solutions of respectively the differential and the algebraic equations.
The latter are usually solved with a Newton method (see Appendix A) while the differential
equations are usually solved by functional iterations [MBB08]. However, the convergence of
this scheme significantly limits the allowed time-step size [BCP95].
The partitioned scheme is attractive for short-term simulations. It is flexible, easy to
organize and allows a number of simplifications to be introduced in order to speed up the
solution [MBB08]. For these reasons, quite a number of dynamic simulation programs are
still based on partitioned solution methods.
1.2. POWER SYSTEM MODELING 9
However, this alternating procedure introduces a “delay” between these variables. That is,
when computing the differential variables, the algebraic are “frozen” and vice-versa. There-
fore, unless care is taken, this process can lead to numerical instabilities or decreased per-
formance [Mil10].
Unlike the partitioned approach, the simultaneous one combines the algebraized differential
and the algebraic equations of (1.1) into one set of nonlinear, algebraic equations. This
approach is used in conjunction with implicit numerical methods that require, at each time
step, the solution of a set of nonlinear equations. The solution is generally obtained through
a Newton’s method (see Appendix A), which requires iteratively computing and factorizing a
Jacobian matrix [Mil10].
Using (1.5) to algebraize Eqs. 1.1, the following equations are obtained:
is usually referred to as “mismatch vector” (of the f and g equations, respectively). To ensure
that all equations are solved within the specified tolerance, the infinite norm (i.e. the largest
component magnitude) of each mismatch vector should be brought below some value. Thus,
the Newton iterations are stopped if:
(l ) (l )
g xk +1 , Vk +1 < eg (1.16a)
∞
(l ) (l )
f xk +1 , Vk +1 < ef (1.16b)
∞
In power system dynamic simulations it is usual that the network variables and parame-
ters are set in the per -unit system [MBB08]. That is, a base power is selected (Sbase ) for the
entire system and using the base voltage levels (usually the nominal voltage of a bus), all the
network parameters and variables are scaled accordingly. Thus, for the network equations
g, the choice of eg is rather easy. For the remaining equations, however, it may be difficult to
choose an appropriate e f value as they include a variety of models and controls for which the
solver does not know whether a mismatch is “negligible” or not. This issue can be resolved
by checking the normalized corrections instead. Thus, Eq. 1.16b is replaced by:
(l ) (l )
∆xk+1 < max e f abs , e f rel xk+1 i = 1, ..., dim(x) (1.17)
i i
(l )
where the relative correction is checked against e f rel . Of course, when xk+1 becomes small,
the absolute correction should be checked against e f abs to avoid numerical exceptions. The
(l )
price to pay is the computation of ∆xk+1 , which requires making at least one iteration before
deciding on the convergence [Fab12, FCHV13].
The network equations are usually expressed in rectangular form in the system’s rotating
reference frame, that is V = Vx , Vy and I = Ix , Iy . The selection of the speed (ωre f )
at which the x − y reference axes rotate is arbitrary; hence, it can be taken for convenience
of the computations. Standard practice for short-term (e.g. transient stability) analysis is to
choose the nominal angular frequency ω0 = 2π f 0 , where f o is the nominal system frequency.
However, this reference frame suffers from a major drawback in long-term studies. After a
disturbance that modifies the power balance, the system settles at a new angular frequency
ω 6= ω0 . Thus, when projected to axes rotating at the speed ω0 , all phasors rotate at the
angular speed ω − ω0 . Consequently, even though the system settles at a new equilibrium,
the current and voltage components (Vx , Vy , Ix , and Iy ) oscillate with a period T = 2π |ω −
ω0 |−1 . This behavior can trigger more solutions of the system and impose the time step to
remain small compared to T, in order to track the oscillations and avoid numerical instability.
This problem can be partially solved by adopting the Center-Of-Inertia (COI) reference
frame initially proposed in [TM72]. The underlying idea is to link the reference axes to the
1.3. DESCRIPTION OF POWER SYSTEM MODELS USED IN THIS WORK 11
1 m
MT i∑
ωcoi = Mi ω i (1.18)
=1
where Mi and ωi are respectively the inertia and rotor speed of the i-th synchronous machine
m
and MT = ∑ Mi is the total inertia. Thus, if the system settles at a new equilibrium with
i =1
angular frequency ω, all machines and the x − y reference axes rotate at the same angular
speed and the current and voltage components become constant.
However, the exact COI reference frame is computationally expensive as it introduces
Eq. 1.18 which couples the rotor speeds of all synchronous machines in the network, while
the motion equation of each machine involves ωcoi . If the simulation is performed with the
simultaneous approach, it adds a dense row and column to the Jacobian of (1.15). Moreover,
this coupling impedes the decomposition of the system as it introduces a coupling between
components all over the system. To alleviate this dependency, it has been proposed in [FV09]
to use the ωcoi value of the previous time instant. This value can be computed explicitly as it
refers to past, already known speed values. The advantages of the COI reference frame are
also preserved, because a slightly delayed COI angle is as good as the exact COI in making
the current and voltage components very little dependent on the system frequency [FV09].
This reference frame is used in this work.
taking into account the nominal power of the TN-DN transformers. Each one of the 146 DNs
is connected to the TN through two parallel transformers equipped with Load Tap Changer
(LTC) devices. Each DN includes 100 buses, one distribution voltage regulator equipped with
LTC, three PhotoVoltaic (PV) units [WEC14], three type-2, two type-3 Wind Turbines (WTs)
[EKM+ 11], and 133 dynamically modeled loads, namely small induction machines and ex-
ponential loads. In total, the combined transmission and distribution system includes 14653
buses, 15994 branches, 23 large synchronous machines, 438 PVs, 730 WTs, and 19419 dy-
namically modeled loads. The resulting model has 143462 differential-algebraic states. The
one-line diagram is sketched in Fig. B.3.
In the second variant, six of the aggregated DN loads in the Central area are replaced by
40 detailed Active Distribution Networks (ADNs), each equipped with the Distribution Network
Voltage (DNV) controller described in [VV13]. Each DN is a replica of the same medium-volt-
age distribution system, whose one-line diagram is shown in Fig. B.4. It consists of eight
11-kV feeders all directly connected to the TN-DN transformer, involving 76 buses and 75
branches. The various DNs were scaled to match the original (aggregate) load powers,
while respecting the nominal values of the TN-DN transformers and other DN equipment.
The combined transmission and distribution model includes 3108 buses, 20 large and 520
small synchronous generators, 600 induction motors, 360 type-3 WTs [EKM+ 11], 2136 volt-
age-dependent loads, and 56 LTC-equipped transformers. The resulting model has 36504
differential-algebraic states. The one-line diagram is sketched in Fig. B.5.
Fast parallel
dynamic simulations
partners, including TSOs, expert companies and leading research centers in power system
analysis and applied mathematics. The University of Liège was the fourth in terms of budget
member of this project.
Within this framework, a test system comprising the continental European synchronous
area (see Fig. B.8) was set up by Tractebel Engineering and RTE, the French TSO, start-
ing from network data made available by the European Network of Transmission System
Operators for Electricity (ENTSO-E).The dynamic components and their detailed controllers
were added in a realistic way. This system includes 15226 buses, 21765 branches, and 3483
synchronous machines represented in detail together with their excitation systems, voltage
regulators, power system stabilizers, speed governors and turbines. Additionally, 7211 user-
defined models (equivalents of distribution systems, induction motors, impedance and dy-
namically modeled loads, etc.) and 2945 LTC devices were also included. The resulting
model has 146239 differential-algebraic states.
are expected to host a big percentage of the renewable energy sources and actively sup-
port the transmission grid through smart grid technologies. Despite this, in present-day DSA
simulations, it is common to represent the bulk generation and higher voltage (transmis-
sion) levels in detail, while the lower voltage (distribution) levels are equivalenced (simplified
models are used). On the contrary, when the study concentrates on a DN, the TN is often
represented by a Thévenin equivalent. One of the motivations behind this practice has been
the lack of computational performance of existing simulation software and the still modest
penetration of active DNs.
The second proposed algorithm targets to accelerate dynamic simulations of large com-
bined transmission and distribution systems. It employs a two-level system decomposition.
First, the combined system is decomposed on the boundary between the transmission and
the DNs revealing a star-shaped sub-domain layout. This first decomposition is reflected to a
separation of the DAEs describing the system projecting them onto the sub-domains. Next, a
second decomposition scheme is applied within each sub-domain, splitting the sub-domain
network from the power system components connected to it, similarly to the first algorithm.
This second decomposition further partitions the DAEs describing the system. Finally, the
solution of the decomposed DAE systems describing the sub-domains is performed hierarchi-
cally with the interface variables being updated using a Schur-complement-based approach
at each decomposition level.
The proposed algorithms augment the performance of the simulation in two ways. First,
the independent calculations of the sub-systems are parallelized providing computational
acceleration. Second, some acceleration techniques are employed, exploiting the locality
of the decomposed sub-systems to avoid unnecessary computations and provide numerical
acceleration.
The algorithms are first presented with a certain level of abstraction, focusing on the math-
ematical properties of DDMs, to avoid limiting the implementation to a particular computer
architecture. Next, the details concerning their implementation using the shared-memory
parallel computing model are presented. Finally, some scenarios are simulated, using the
test systems presented above, to show the accuracy and performance of the algorithms.
Both algorithms have been implemented in the academic simulation software RAMSES2 ,
developed at the University of Liège since 2010, and are used in the context of academic
research, as well as in collaborations with industry. The implementation targets common, in-
expensive, multi-core machines without the need of expensive dedicated hardware. Modern
Fortran and the OpenMP Application Programming Interface (API) are used. The imple-
mentation is general, with no hand-crafted optimizations particular to the computer system,
operating system, simulated electric power network, or disturbance.
Chapter 3 This chapter is devoted to DDMs and their applications to power systems. First,
the motivation behind the development of DDMs and the essential components that
describe these methods are introduced. Next, an effort is made to present previous
work on power system dynamic simulation using DDMs and parallel computing.
Chapter 4 In this chapter, the first proposed algorithm is presented. First, the power sys-
tem decomposition and the formulation of the sub-domain DAEs is described. Next,
the solution of the decomposed system and the treatment of the interface variables is
detailed. Then, the numerical acceleration techniques are introduced, followed by the
description of parallel implementation of the algorithm. Finally, the performance of the
algorithm is assessed using the test systems described above.
The mathematical aspects of the algorithm were published in [AFV14] (material ac-
cepted for publication in 2012) while, its application to parallel power system dynamic
simulations was presented in [AFV13b]. Next, the acceleration techniques were pre-
sented in [AFV13a]. A paper focusing on the real-time performance capabilities of the
algorithm was presented in [AV14c]. Finally, the complete algorithm was presented at
the panel session "Future Trends and Directions in Dynamic Security Assessment" at
the 2014 IEEE Power & Energy Society General Meeting in Washington DC [AV14b].
Chapter 5 In this chapter, the second proposed algorithm is presented, featuring a two-level
power system decomposition. The hierarchical solution of the decomposed systems
and the treatment of the interface variables is detailed, followed by some numerical
acceleration techniques used to speedup the procedure. Next, details on the parallel
implementation of the algorithm are given. Finally, the performance of the algorithm
is assessed using the combined transmission and distribution test systems described
above, as well as the Hydro-Québec real system.
Chapter 6 In this chapter, the contribution of this thesis is summarized and some plans for
future work are suggested.
Overall, this thesis expands the following material which has been published, submitted, or
is under preparation, in various journals and conferences:
Under preparation:
[17] T. Kyriakidis, P. Aristidou, D. Sallin, T. Van Cutsem, and M. Kayal. A Linear Algebra
Enabled Mixed-Signal Computer for Power System Applications. To be submitted to
ACM Journal on Emerging Technologies in Computing Systems, 2015
Under review:
[16] P. Aristidou, S. Lebeau, and T. Van Cutsem. Fast Power System Dynamic Simulations
using a Parallel two-level Decomposition Algorithm. Submitted to IEEE Transactions on
Power Systems (under review), 2015.
Published:
[13] F. Olivier, P. Aristidou, D. Ernst, and T. Van Cutsem. Active management of low-voltage
networks for mitigating overvoltages due to photovoltaic units. IEEE Transactions on
Smart Grid (in press), 2015. Available at: http://hdl.handle.net/2268/172623
[12] P. Aristidou and T. Van Cutsem. A parallel processing approach to dynamic simula-
tions of combined transmission and distribution systems. International Journal of Elec-
trical Power & Energy Systems, vol. 72, pp. 58–65, 2015.
Available at: http://hdl.handle.net/2268/178765
[11] P. Aristidou, S. Lebeau, L. Loud, and T. Van Cutsem. Prospects of a new dynamic sim-
ulation software for real-time applications on the Hydro-Québec system. In Proceed-
ings of 2015 CIGRÉ Canada conference (accepted, final paper due June 15, 2015),
Winnipeg, September 2015.
[8] P. Aristidou and T. Van Cutsem. Dynamic Simulations of Combined Transmission and
Distribution Systems using Parallel Processing Techniques.In Proceedings of the 18th
Power System Computational Conference (PSCC), Wroclaw, August 2014.
Available at: http://orbi.ulg.ac.be/handle/2268/168618
[7] P. Aristidou and T. Van Cutsem. Algorithmic and computational advances for fast
power system dynamic simulations. In Proceedings of the 2014 IEEE PES General
Meeting, Washington DC, July 2014. Available at: http://hdl.handle.net/2268/163168
[6] P. Aristidou, F. Olivier, D. Ernst, and T. Van Cutsem. Distributed model-free control of
photovoltaic units for mitigating overvoltages in low-voltage networks. In Proceedings
of 2014 CIRED workshop, Rome, June 2014.
Available at: http://hdl.handle.net/2268/165629
[5] P. Aristidou and T. Van Cutsem. Parallel computing and localization techniques for
faster power system dynamic simulations. In Proceedings of 2014 CIGRÉ Belgium
conference, Brussels, March 2014. This paper received the Best Paper student award.
Available at: http://hdl.handle.net/2268/161322
[4] P. Aristidou, D. Fabozzi, and T. Van Cutsem. Dynamic simulation of large-scale power
systems using a parallel Schur-complement-based decomposition method. IEEE Trans-
actions on Parallel and Distributed Systems, 25(10):2561–2570, Oct 2013.
Available at: http://hdl.handle.net/2268/156230
[3] P. Aristidou and T. Van Cutsem. Dynamic simulations of combined transmission and
distribution systems using decomposition and localization. In Proceedings of 2013
IEEE PES PowerTech conference, Grenoble, June 2013. This paper received the High
Quality Paper award. Available at: http://hdl.handle.net/2268/145092
[2] P. Aristidou, D. Fabozzi, and T. Van Cutsem. Exploiting localization for faster power
system dynamic simulations. In Proceedings of 2013 IEEE PES PowerTech confer-
ence, Grenoble, June 2013. Available at: http://hdl.handle.net/2268/145093
[1] P. Aristidou, D. Fabozzi, and T. Van Cutsem. A Schur complement method for DAE
systems in power system dynamic simulations. In Domain Decomposition Methods in
Science and Engineering XXI , volume 98 of Lecture Notes in Computational Science
and Engineering, Springer International Publishing, 2014 (material accepted for publi-
cation in 2012). Available at: http://hdl.handle.net/2268/154312
C HAPTER 2
Think parallel
Power wall: unacceptable growth in power usage with clock rate. The power wall results
because power consumption (and heat generation) increases nonlinearly as the clock
rate increases. Increasing clock rates any further would exceed the power density that
19
20 CHAPTER 2. THINK PARALLEL
can be dealt with by air cooling, and result in power-inefficient computation [MRR12].
Instructional parallelism wall: limits to available low-level parallelism (see Section 2.2.3).
It becomes increasingly more difficulty to find enough parallelism in a single instruction
stream to keep a high-performance single-core processor busy.
Memory wall: a growing discrepancy of processor speeds relative to memory speeds [McK04].
This, in effect, pushes for cache sizes to be larger in order to mask the latency of mem-
ory. However, it only helps to the extent that memory bandwidth is not the bottleneck in
performance.
Nevertheless, the number of transistors that can be put on a single chip is still expected
to grow exponentially for many years (Moore’s law). So, a simple idea is to use this extra
area of silicon to add multiple cores on the same chip, each of lower frequency and power
consumption. Thus, the processor now has the potential to do multiple times the amount of
work. Moreover, when the processor clock rate falls, the memory wall problems become less
noticeable [McK04].
On the one hand, unparallelized applications under-utilize current multi-core processors
and leave significant performance on the table. In addition, such serial applications will not
improve in performance over time. On the other hand, efficiently parallelized applications can
fully exploit multi-core processors and should be able to scale automatically to even better
performance on future processors. Over time, this will lead to large and decisive differences
in performance between sequential and parallel programs.
Knowing the “arsenal” at hand, that is the available parallel computing tools and tech-
niques, permits to detect parallelization possibilities in a computational problem and assess
whether the available parallelism is exploitable (is worth the trouble). However, parallel com-
puting is a very active research field with new architectures being invented and new tools
implemented every day. It is thus impossible to exhaustively list and detail all of them here.
In this chapter we first try to categorize the main parallel architectures. Then, the theory
behind the performance assessment of parallel algorithms and implementations is outlined,
followed by some common pitfalls that can hinder their efficiency. Emphasis is given on
shared-memory parallel computing techniques and equipment, as these are used later on
for this work.
The computations performed by a given program may provide opportunities for parallel ex-
ecution at different levels: algorithm-level, data-and task-level, instruction-level, or bit-level.
Depending on the level considered, tasks of different granularity result.
2.2. TYPES OF PARALLELISM 21
This is the top-level or coarse-grain parallelism and pertains to a certain level of machine-
independent abstraction. In scientific computing problems, algorithm-level parallelism is
closely coupled to the mathematical formulation of the problem and the methods used in
its solution. It is known that the same scientific problem might be solved with a variety of
mathematical tools. However, some of these tools offer higher potential for parallelism than
others. Thus, even though a certain method might be the fastest in sequential execution,
it is sometimes useful to chose a different method that can be more efficiently parallelized.
A well known family of solution methods that offer high potential of parallelism are Domain
Decomposition Methods (DDMs) and will be later examined in Chapter 3.
This type of parallelism requires a good knowledge of the underlying problem and its
solution mechanics. It is thus impossible for an automatic tool to reformulate an algorithm so
as to expose algorithm-level parallelism.
Data Level Parallelism (DLP) is parallelism in which the same operations are being performed
on different pieces of data concurrently. Because the operations are on different data, they
are known to be independent, which means that dependence checking is not needed. Task
Level Parallelism (TLP) focuses on faster execution by dividing calculations onto multiple
cores. TLP programs might execute the same or different code on the same or different data.
There is no clear separation between the two and a typical program exhibits both types
of parallelism. Moreover, compilers cannot easily find TLP and DLP to exploit in a program,
so the programmer must usually perform extra work to specify when this parallelism exists.
Instruction Level Parallelism (ILP) typically refers to how a sequential program run on a sin-
gle core can be split into micro-instructions. Multiple micro-instructions from subsequent
instructions of the same program are then executed concurrently in a pipeline. This type
of parallelism is driven by the compiler. For the programmer, this has the advantage that
sequential programming languages can lead to a parallel execution of instructions without
his/her intervention.
However, the degree of parallelism obtained by ILP is limited, since it is not possible to
partition the execution of the instruction into a very large number of steps of equal size. This
limit has already been reached for some time for typical processors [RR13] and it becomes
increasingly more difficulty to find enough ILP in a single instruction stream to keep a high-
performance single-core processor busy.
22 CHAPTER 2. THINK PARALLEL
• Single Instruction, Single Data (SISD): This is just a standard non-parallel processor.
• Single Instruction, Multiple Data (SIMD): A single operation (task) executes simul-
taneously on multiple elements of data. SIMD processors are also known as array
processors, since they consist of an array of functional units with a shared controller.
• Multiple Instruction, Multiple Data (MIMD): Separate instruction streams, each with
its own flow of control, operate on separate data. This characterizes the use of multiple
2.3. PARALLEL COMPUTER HARDWARE 23
Parallel Computers
MIMD SIMD
• Multiple Instruction, Single Data (MISD): This last possible combination is not par-
ticularly useful and is not used.
The advantage of Flynn’s taxonomy is that it is very well established. Every parallel program-
mer is familiar with the terms MIMD and SIMD. There are, however, some serious problems,
the biggest one being that it provides only four slots to categorize a huge variety of existing
systems. This granularity doesn’t give us enough ways to separate systems [MSM04].
• Shared Memory:
– Symmetric MultiProcessing (SMP): The memory is physically shared, and all pro-
cessors access the memory equally at equal speeds. These are sometimes called
Uniform Memory Access (UMA).
– Non-Uniform Memory Access (NUMA): The memory is physically shared, but not
distributed in a one to one relation with the processors. Access to different portions
of the memory may require significantly different times.
– Distributed Shared Memory (DSM): The memory is distributed among the proces-
sors, but the system gives the illusion that it is shared. It is also called virtual
shared memory.
24 CHAPTER 2. THINK PARALLEL
• Distributed Memory:
– Fixed: The number of connections is fixed as more processors are added (e.g.,
Ethernet-connected workstations, since the Ethernet interconnection is a single
resource that all processors share).
– Linear: The number of connections grows linearly with the number of nodes (e.g.,
mesh-connected multicomputers such as the Intel Paragon).
In the following subsections a summary of the parallel programming models considered for
this work will be presented.
Graphics Processing Units (GPUs) are highly parallel vector processors with local memory
for each of the processing cores. The total amount of memory is usually smaller than tradi-
tional multi-core CPUs (across different levels of cache) but the number of processing cores
is normally larger [LDTY11]. They were originally developed for rendering real-time visual
effects in the video gaming industry and are present in all modern computers. Recently, they
have become programmable to the point where they can be used as a general purpose pro-
gramming platform. General purpose programming on the GPU (GPGPU) [FM04, HA11] is
currently getting a lot of attention in the scientific community due to the low cost, high avail-
ability and computational power. GPUs are really good for fine-grained parallelization and
have been used in many linear algebra applications [HKB12, BHSt13].
Power system applications that have GPU-accelerated linear algebra operations have
been presented, e.g. power flow studies [GNV07, Gar10, SA10, VMMO11, ADK12, GJY+ 12,
LL14]; state estimation [TDL11, KD13]; transient stability/electromechanical transients sim-
ulation [JMD10, JMZD12, BWJ+ 12, QH13]; electromagnetic transients simulation [DGF12,
Cie13, GO13, ZD14]; and optimization, e.g. Optimal Power Flow [RR14].
However, GPUs are not as good in handling the irregular computation patterns (unpre-
dictable branches, looping conditions, irregular memory access patterns, etc.) that most
engineering software deal with. Hence, in all cases, the heterogeneous computing concept
is used: the CPU assumes main control of the application and the GPU is accelerating the
burdensome linear algebra operations. The CPU to GPU data transfer link has relatively high
latency introducing a significant bottleneck in the execution of the program [TOG14].
Finally, there is a high effort needed to develop and maintain GPGPU code and low
portability as no default standard exists among GPU vendors. Although some programming
models have been implemented trying to facilitate programming on such architectures and
increase portability of the code (e.g. OpenCL), at the moment of starting this project in 2011
26 CHAPTER 2. THINK PARALLEL
they were still in the early stages without any accumulated experience and documentation.
For all the above reasons, this programming model was rejected.
The Cilk (pronounced “silk”) project originated in the mid-1990s at M.I.T. Its successor,
Cilk Plus, is integrated with a C/C++ compiler and extends the language with the addition
of keywords and array section notation. It uses the fork–join pattern [MRR12] to support ir-
regular parallel programming patterns, parallel loops to support regular parallel programming
patterns, and, supports explicit vectorization via array sections, pragma simd, and elemental
functions [MRR12].
TBB is a library, not a language extension, and thus can be used with any compiler
supporting ISO C++. It relies on templates and generic programming and uses C++ features
to implement its “syntax.” TBB requires the use of function objects (also known as functors)
to specify blocks of code to run in parallel. Like Cilk Plus, TBB is based on programming in
terms of tasks, not threads. This allows it to reduce overhead and to more efficiently manage
resources. As with Cilk Plus, TBB implements a common thread pool shared by all tasks and
balances load via work-stealing [MRR12].
OpenMP is a standard organized by an independent body called the OpenMP Architec-
ture Review Board. It is based on a set of compiler directives or pragmas in Fortran, C,
and C++ combined with an API for thread management. OpenMP is designed to simplify
parallel programming for application programmers working in High-Performance Computing
(HPC), including the parallelization of existing serial codes. Prior to OpenMP (first released
in 1997), computer vendors had distinct directive-based systems. OpenMP standardized
common practice established by these directive-based systems and is supported by most
compiler vendors including the GNU compilers and other open source compilers [MRR12].
The previous three models are suitable to implement the fork-join parallel pattern asso-
ciated with “divide-and-conquer” algorithms. Moreover, all three models exhibit high perfor-
mance, have increased compatibility with existing platforms and compilers, and are widely
adopted on shared-memory multi-core computers. However, out of these candidates, the
OpenMP API was chosen for our application. The main reason behind this choice is the
model’s support of modern Fortran language which is the language of the simulation soft-
ware RAMSES. More details about the selection of the programming language and the par-
allel programming model are given in Appendix C.
Some typical power system applications have been investigated using this model, such
as: power flow studies [Bos95, DS11, LDTY11, NAA11, Fra12, ASC13]; state estimation
[NMT+ 06, LDTY11, DN12]; transient stability/electromechanical transients simulation [CZBT91,
BVL95, HB97, XCCG10, AF12, LDG+ 12]; and, electromagnetic transients simulation [UBP10,
UH11, UD13].
The primary purpose of parallelization, as discussed in this work, is performance. But what
is performance? Usually it is about one of the following:
28 CHAPTER 2. THINK PARALLEL
where T1 is the runtime of the program with one worker and TM is the runtime of the same
program, using the same algorithm, with M workers . This index shows how well the parallel
implementation scales when the number of available workers is increased.
Optimally, scalability would be linear, i.e. with double the number of processing units
the program should execute in half the runtime, and doubling it a second time should again
halve the runtime. However, very few parallel algorithms achieve optimal scalability except
embarrassingly parallel problems (where there exists no dependency or communication be-
tween parallel tasks). Most real-life parallel programs exhibit a near-linear scalability for small
numbers of processing units, which flattens out into a constant value for large numbers of
workers.
The speedup of a parallel implementation expresses the relative saving of execution time
that can be obtained by using a parallel execution on M processing units compared to the
best sequential implementation. This can be formulated as:
T1∗
Speedup M = (2.2)
TM
where T1∗ is the runtime of the program with one worker using the fastest (or a very fast)
sequential algorithm. For power system dynamic simulations, the popular VDHN scheme
(see Appendix A) has been suggested as the sequential benchmark [CB93]. Although there
is no proof that this is the fastest sequential algorithm, it is employed by many industrial and
academic software and its capabilities and performance are well-known.
Moreover, the sequential algorithm needs to be solving exactly the same problem as
the parallel with the same accuracy. For this reason, all the algorithms considered in this
work (proposed parallel and sequential VDHN) have been implemented in the same soft-
ware (RAMSES). More precisely, they solve exactly the same model equations, to the same
accuracy, using the same algebraization method (namely the second-order BDF), way of
handling the discrete events, mathematical libraries (e.g. sparse linear solver), time-stepping
strategy, etc. Keeping the aforementioned parameters the same allows for a more rigorous
evaluation of the proposed algorithm’s performance.
Finally, the efficiency of a parallel algorithm is defined as:
Scalability M T1
E f f iciency M = = (2.3)
M MTM
It is a value, typically between zero and one, estimating how well-utilized the processors
are in solving the problem, compared to how much effort is wasted in communication and
synchronization. Ideally, efficiency should be equal to one, which corresponds to a linear
scalability, but many factors can reduce it (as already mentioned) [MRR12]. Algorithms with
linear speedup and algorithms running on a single processor have an efficiency of one, while
many difficult-to-parallelize algorithms have efficiency such as 1
ln M that approaches zero as
the number of processors increases.
In the following subsections, the three main theories for predicting or assessing the per-
formance of parallel algorithm are outlined.
30 CHAPTER 2. THINK PARALLEL
24
22
P=100%
20
18
P=99%
16
Theoretic scalability
14
12
10 P=95%
8
P=90%
6
4 P=80%
P=70%
2
1 P=40%
1 2 4 6 8 10 12 14 16 18 20 22 24
# of workers
TP
TM = TS + (2.4)
M
where TS is the time spent in the serial part of the code and TP the time spent in the parallel
part. Of course, TP and TS have to account for 100% of T1 , that is T1 = TP + TS . Based on
(2.1) and (2.4), scalability is rewritten as:
TS + TP
Scalability M = TP
(2.5)
TS + M
Software profiling can give a measurement of TP and TS based on the portions of the code
scheduled for parallelization. However, if only the semantics (“blueprint”) of the algorithm are
available, then these values have to be estimated from the mathematical formulation and the
expected computation times1 .
1 For
the calculations presented in Chapters 4 and 5, the profiling software Intel VTune Amplifier was used on
a version of the algorithms executed on one core to obtain the values of the parallel and sequential portions.
2.5. PERFORMANCE THEORY 31
Using Eq. 2.5, the expected performance of a parallel algorithm can be calculated for
a varying number of workers. Figure 2.2 shows these calculations for different values of
TP
P= TS + TP . It can be seen why linear scalability is extremely difficult to achieve: even with 1%
of the computations in sequential execution, the scalability variation differs significantly from
linear. Moreover, if the limit M → ∞ is taken on Eq. 2.5, the resulting value shows that the
maximum scalability is bounded by the serial time TS .
Unfortunately, Eq. 2.5 gives an optimistic upper bound of the algorithm’s scalability. It
assumes that the parallel computations are infinitely divisible and ignores the OverHead
Cost (OHC) associated to making the code run in parallel and managing the threads and
the communication between them. The communication might be the command for all the
threads to start, the exchange of information, or it might represent each thread notifying the
main thread that it has completed its work. To compensate for this OHC, Amdahl’s law is
modified to [Gov10]:
TP
TM = TS +
+ OHC ( M) (2.6)
M
where the OHC ( M) is proportional to memory latency for those systems that communicate
through memory, or cache latency if all the communicating threads share a common level of
cache.
This modified formula suggests that the scalability of an application can be increased
either by increasing the percentage of computations in the parallel portion or by reducing the
synchronization and communication costs. Also, when OHC is large enough and for small
amounts of parallel work (TP ), situations can be encountered where increasing the number
of available threads will actually slow down the application. That is, if:
TP TP
TM − TM+1 = − + (OHC ( M) − OHC ( M + 1)) < 0 (2.7)
M M+1 | {z }
| {z } incremental OHC
incremental gain
then adding extra workers to a parallel program can be detrimental to its performance. Unfor-
tunately, there is no way to calculate the exact value of OHC before implementing the parallel
code.
1 2 3 # of workers
Sequential task that needs to be
S S S executed before the parallel portion
(e.g. data reading, initialization, etc.)
P1 P1 P3 P1 P2 P3
work=T1=S+P1+P2+P3
P2 P2 span=T∞=T3=S+P3
Independent tasks that can be
performed in parallel but
TM P3 are not further divisible
Figure 2.3: Work-span: scalability is limited due to the discretized nature of the parallel tasks
Gustafson-Barsis’ law notes that as the problem size grows to take advantage of more
powerful computers, the work required for the parallel part of the problem usually grows faster
TS
than the serial part. Consequently, the fraction TP decreases and scalability improves. This
observation should be used as a guideline when deciding the algorithm-level parallelism. If a
parallel algorithm is designed correctly, it should take advantage of the future problem scaling
to increase its performance automatically.
As mentioned above, the trend in power system simulations is to model in higher detail
the electric components. Another target is to include detailed models of DNs and perform
combined simulations of TN and DN systems to obtain more consistent results among them
[Hed14]. The algorithms proposed in Chapters 4 and 5 were designed to accommodate for
this future model expansion in their parallel part, thus leading to higher scalability as the
modeling demands increase, with the same parallel computing resources.
Both Amdahl’s and Gustafson-Barsis’ laws are correct. The difference lies in whether the
aim is to make a program run faster with the same workload or run in the same time with
a larger workload. History clearly favors programs getting more complex and solving larger
problems, so Gustafson’s observations fit the historical trend. Nevertheless, Amdahl’s law
still haunts us when the target is to make an application run faster on the same workload to
meet some latency (e.g. real-time simulations) [MRR12].
As mentioned earlier, Amdahl’s law makes the assumption that computations in the parallel
portion of the algorithm are infinitely divisible. However, this is not true for most applications.
In the work-span model, time T1 is called the work of an algorithm. It is the time that the
algorithm would take running on one core. Time T∞ is called the span of an algorithm and
is the time a parallel algorithm would take on an ideal machine with an infinite number of
processors. Alternatively, the span gives the longest chain of tasks that must be executed
one after each other [MRR12].
2.6. SHARED-MEMORY COMPUTERS PERFORMANCE CONSIDERATIONS 33
This analysis takes into consideration situations where the parallel computations are dis-
crete, thus increasing the workers beyond a certain number has no effect on scalability.
Figure 2.3 shows one such example: increasing the workers to more than three has no effect
on the execution time as the parallel tasks are not further divisible, thus T∞ = T3 = S + P3 .
So, the serial and the largest indivisible parallel tasks dictate the algorithm’s span.
This model gives a more realistic upper limit to the scalability as:
T1 work
Scalability M = ≤ (2.8)
TM span
T1 − T∞
TM ≤ + T∞ (2.9)
M
2.6.1 Synchronization
A race condition occurs when concurrent tasks perform operations on the same memory
location without proper synchronization, and one of the memory operations is a write. Code
with a race may sometimes fail unpredictably. Races are not limited to memory locations but
can also happen with I/O operations. For example, if two tasks try to print Hello at the same
time, the output might look like HeHelllloo, or might even crash the program [MRR12].
To avoid such problems, the programmer should clearly map the data dependencies of
the algorithm parallel tasks. Once this is done, the first priority should be to eliminate as
many of these dependencies as possible to avoid introducing synchronization that leads
to increased OHC. Then, for the remaining dependencies, the appropriate synchronization
mechanisms should be used to ensure the correct execution of the program.
OpenMP offers several mechanisms to help the programmer with this task. For exam-
ple, when a parallel segment is defined, all the data variables accessed by the parallel tasks
34 CHAPTER 2. THINK PARALLEL
can be categorized as shared (which means visible and accessible by all threads simultane-
ously), private (which means each thread will have a local copy and use it as a temporary
variable), and firstprivate (like private except initialized to original value). The clause default
assigns the default property for all variables.
Other mechanisms available in OpenMP to eliminate data races are the critical sections,
which define portions of the code to be executed only by one thread at a time, and atomic,
which ensures that a specific storage location is updated atomically (by only one thread).
Another synchronization technique in OpenMP uses locks. Locks are a low-level way to
eliminate races by enforcing limits on access to a resource in an environment where there
are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency
control policy. However, the extensive use of locks can lead to strangling scalability or even
deadlocks (when at least two tasks wait for each other and each cannot resume until the
other task proceeds). Such low-level mechanisms are not used in RAMSES and thus will not
be detailed here.
Finally, it might be useful to use a data race detection program that can identify potential
problems in the code. One such program is Intel Inspector XE, which has been routinely
used to detect data racing problems in RAMSES.
• Temporal locality: A worker is likely to access the same location again in the near future.
• Spatial locality: A worker is likely to access nearby locations in the near future.
The locality model asserts that memory accesses close in both time and space are cheaper
than those that are far apart. Having good locality in a program can significantly reduce
communication OHC [MRR12].
The cost of communication is not uniform and varies depending upon the type of shared-
memory architecture and the location of the worker. Small shared-memory machines (e.g.
multi-core laptops and office desktops) have UMA architecture; thus each individual proces-
sor can access any memory location with the same speed. On the contrary, larger shared-
memory machines usually have NUMA architecture, hence some memory may be “closer to”
one or more of the processors and accessed faster by them [CJV07].
The main benefit of NUMA computers over UMA is scalability, as it is extremely difficult to
scale UMA computers beyond 8-12 cores. At that number of cores, the memory bus is under
heavy contention. NUMA is one way of reducing the number of CPUs competing for access
to a shared memory bus by having several memory buses and only having a small number
of cores on each of those buses.
2.6. SHARED-MEMORY COMPUTERS PERFORMANCE CONSIDERATIONS 35
! ! ! ! ! ! "
"
The cache coherent NUMA (cc-NUMA) node presented in Fig. 2.4 is part of a 48-core
NUMA parallel computer, based on 6238 AMD Opteron Interlagos, available at the University
of Liège. The computer has four identical sockets, each hosting two NUMA nodes with six
cores (as sketched in Fig. 2.4). So, even though the system physically has four CPU sockets
with 12 cores each, there are in fact eight NUMA nodes with six cores each.
Resources within each node are tightly coupled with a high speed crossbar switch and
access to them inside a NUMA node is fast. Moreover, each core has dedicated L1 cache,
every two cores have shared L2 cache and the L3 cache is shared between all six cores.
These nodes are connected to each other with HyperTransport 3.0 links. The bandwidth
is limited to 12GB/s between the two nodes in the same socket and 6GB/s to other nodes.
Thus, the cost is minimal for lanes of a vector unit, relatively low for hardware threads on
the same core, more for those sharing an on-chip cache memory, and yet higher for those in
different sockets [MRR12].
Parallel applications executing on NUMA computers need special consideration to avoid
high OHC. First, given the large remote memory access latency, obtaining a program with
a high level of data locality is of the utmost importance. Hence, some features of the ar-
chitecture and Operating System (OS) affect the application’s performance (bind threads to
particular CPUs, arrange the placement and migration of memory pages, etc.) [CJV07].
Data accessed more frequently by a specific thread should be allocated “close” to that
thread. First Touch memory allocation policy, which is used by many OS, dictates that the
thread initializing an object gets the page associated with that item in the memory local to the
processor it is currently executing on. This policy works surprisingly well for programs where
the updates to a given data element are typically performed by the same thread throughout
the computation. Thus, if the data access pattern is the same throughout the application, the
initialization of the data should be done inside a parallel segment using the same pattern so
as to have a good data placement in memory. This data initializing procedure is followed in
RAMSES.
36 CHAPTER 2. THINK PARALLEL
Some further consideration is needed when large amount of data are read from files to
avoid page migration during the initialization. This problem usually affects NUMA machines
with low link speed and applications with intensive I/O procedures. In power system dynamic
security assessment the data reading is usually done once and then used numerous times
to asses several different contingencies on the same system, thus this feature is not critical
to their overall performance.
The second challenge on a cc-NUMA platform is the placement of threads onto the com-
puting nodes. If during the execution of the program a thread is migrated from one node to
another, all data locality achieved by proper data placement is destroyed. To avoid this we
need some method of binding a thread to the processor by which it was executed during
the initialization. In the proposed implementation, the OpenMP environment variable OMP_-
PROC_BIND is used to prevent the execution environment from migrating threads. Several
other vendor specific solutions are also available, like kmp_affinity in Intel OpenMP imple-
mentation, taskset and numactl under Linux, pbind under Solaris, bindprocessor under IBM
AIX, etc.
One of the most important tasks of parallel programming is to make sure that parallel threads
receive equal amounts of work [Cha01b, CJV07]. Imbalanced load sharing among threads
will lead to delays, since some of them will be still working while others will have finished and
be idle. This is shown in Fig. 2.3, where in the case of three cores, the two of them finish
their jobs faster. Load imbalance can be mitigated by further decomposition of the parallel
tasks and properly sharing them among the workers [MRR12]. Like packing suitcases, it is
easier to spread out many small items evenly than a few big items.
Fortunately, OpenMP offers some mechanisms to facilitate load balancing. For example,
when considering the loop-level parallelization (a construct frequently used in engineering
problems), the schedule clause allows for the proper assignment of loop iterations to threads.
However, the best load balancing strategy depends on the computer, the actual data input,
and other factors not known at programming time. In the worst case, the best strategy may
change during the execution time due to dynamic changes in the behavior of the loop or in
the resources available in the system. Even for advanced programmers, selecting the best
load balancing strategy is not an easy task and can potentially take a large amount of time.
OpenMP offers three default strategies to assign loop iterations (where each iteration is
treated as a task) to threads. With the static strategy, the scheduling is predefined and one
or more successive iterations are assigned to each thread rotationally prior to the parallel
execution. This decreases the overhead needed for scheduling but can introduce load im-
balance if the workload inside each iteration is not the same. With the dynamic strategy,
the scheduling is dynamic during the execution. This introduces a high OHC for managing
the threads but provides the best possible load balancing. Finally, with the guided strategy,
2.7. DESCRIPTION OF COMPUTERS USED IN THIS WORK 37
the scheduling is again dynamic but the number of successive iterations assigned to each
thread are progressively reduced in size. This way, scheduling OHC is reduced at the be-
ginning of the loop and good load balancing is achieved at the end. Of course, many other,
non-standard, scheduling strategies have been proposed in literature [Cha01b].
In general, when deciding the scheduling strategy, the following should be considered:
• optimizing spatial locality, by assigning tasks that access continuous memory segments
to the same thread whenever possible; and,
The above performance considerations will be revisited in Chapters 4 and 5, after the parallel
algorithms have been defined. The selection of the proper synchronization and scheduling
methods will be justified based on the proposed parallel algorithm specifications.
1. AMD Opteron Interlagos CPU 6238 @ 2.60GHz, 16KB private L1, 2048KB shared per
two cores L2 and 6144KB shared per six cores L3 cache, 64GB RAM, Debian Linux 8
(Fig. 2.4)
2. Intel Core i7 CPU 4510U @ 3.10GHz, 64KB private L1, 512KB private L2 and 4096KB
shared L3 cache, 7.7GB RAM, Microsoft Windows 8.1
3. Intel Core i7 CPU 2630QM @ 2.90GHz, 64KB private L1, 256KB private L2 and
6144KB shared L3 cache, 7.7GB RAM, Microsoft Windows 7
2.8 Summary
Programmers can no longer depend on rising clock rates (frequency scaling) to achieve
increasing performance for each new processor generation, due to the power wall. Moreover,
they cannot rely on automatic mechanisms to find parallelism in naive serial code, due to
the instructional parallelism wall. Thus, to achieve higher performance, they have to write
explicitly parallel programs [MRR12].
To achieve this, engineers need to revisit the algorithms used for each problem and try
to extract the maximum possible algorithm-level parallelism. If necessary, algorithms that
38 CHAPTER 2. THINK PARALLEL
cannot be parallelized should be replaced by more suitable ones. When doing so, the future
scaling of the problem size should be considered to ensure that the algorithm’s scalability will
increase as well. Based on the granularity of the parallel algorithm and the targeted computer
architectures, the proper parallel programming model should be selected.
Finally, the programmer should study the communication and memory access patterns to
avoid unnecessary OHC that could strangle the scalability of the program.
C HAPTER 3
Domain decomposition methods and
applications to power systems
3.1 Introduction
39
40 CHAPTER 3. DDMS AND THEIR APPLICATION TO POWER SYSTEMS
the problem. However, they lost their appeal as larger memory and better scaling solvers
became available, only to resurface in the age of parallel computing. These methods are
inherently suited for execution on parallel architectures and many parallel implementations
have been proposed on multi-core computers, clusters, GPUs, etc.
In this chapter we overview some of the features that define any given DDM. Then, we
focus on DDMs proposed for dynamic simulations of power systems.
The first step in designing a DDM is to identify the preferred characteristics of sub-domains.
This includes choosing the number of sub-domains, the type of partitioning, and the level of
overlap. Each of these choices depend on a variety of factors such as size and type of the
domain, the number of parallel processors, communication cost, and the system’s underlying
dynamics.
The main target when partitioning the domain is to minimize the interfaces between the
sub-domains. This will allow for lower communication requirements and a simpler handling
of interface variables. In PDE problems, where DDMs have been mostly applied, the decom-
position is usually based on the geometrical data and the order of the discretization scheme
used [Saa03, TW05]. Conversely, in DAE/ODE problems (such as the one under considera-
tion in this work), no a priori knowledge of the coupling variables is available since there are
no regular data dependencies (such as those defined by geometric structures). In several
cases, the so-called dependency matrix (D) can be used. For a system with N equations
and N unknown variables, D is an NxN matrix with D (i, j) = 1, if the i-th equation involves
the j-th variable, and D (i, j) = 0 otherwise. However, each system model can be composed
of several sub-models which are sometimes hidden, too complex, or used as black boxes.
Hence, an automatic calculation of D is not trivial to implement [GTD08].
Moreover, in so far as the sub-problems obtained from a given partitioning are solved in
parallel, there are some restrictions regarding the type of partitioning needed. For example,
the DAE model of a power system component (e.g. synchronous machine, wind turbine, etc.)
has a dense dependency matrix. Thus, the component should not be split between sub-
domains as it will create a very strong connection between different sub-domains, increase
the communication cost, and decrease the efficiency of the DDM.
3.2. DDM CHARACTERISTICS 41
SD1 SD2
SD3 SD4
Overall, decomposition schemes for DAE/ODE systems have to rely on problem specific
techniques which require good knowledge of the underlying system, the models composing
it, and the interactions between them. A bad selection of the system partitioning can lead to
the DDM solution not converging or converging very slowly.
2. local interface variables which are coupled through both local and non-local (external
to the sub-domain) equations (xiext ); and,
3. external interface variables that belong to other sub-domains and are coupled through
local equations (xext
j ).
In general, standard methods for treating PDE, ODE, or DAE problems are employed to
solve each sub-system. For example, to solve DAE systems coming from dynamic simula-
tions of power systems, the techniques presented in Section 1.2 can be used.
Next, it is decided whether the sub-problems will be solved approximately or exactly be-
fore exchanging information (updating the interface values) with other sub-domains. This
decision always leads to a compromise between numerical convergence speed and data ex-
change rate. If the interface values are frequently exchanged, global convergence is usually
better since the sub-domain solution methods always use recent values of interface variables.
However, this leads to higher data exchange rate. If the interface values are infrequently ex-
changed, there is lower data exchange rate. Nevertheless, the global convergence might
degrade since the sub-domain solution methods use older interface values.
Usually, when the sub-domains are weakly connected, solving accurately and avoiding to
update often the interface values is better. This kind of partitioning, though, might be very
difficult, or even impossible, to obtain. So, updating the interface values more often might be
required to improve the convergence rate.
42 CHAPTER 3. DDMS AND THEIR APPLICATION TO POWER SYSTEMS
With most recent interface values: Build and solve reduced system
Update the local linear systems of SDi (i=1,..,M) to obtain the interface values
Eliminate the interior variables of SDi (i=1,..,M)
Parallel threads
no Parallel threads
yes
When applying the Schur-complement method, also called iterative sub-structuring, the orig-
inal physical domain is split into non-overlapping sub-domains. Next, an iterative method
(e.g. Newton’s method) is used to solve the sub-problems by formulating a, corresponding
to each sub-domain, local linear system. The Schur-complement technique is a procedure
to eliminate the interior variables in each sub-domain and derive a global, reduced in size,
linear system involving only the interface variables. This reduced system can be solved to
obtain the interface variables [Saa03].
Once the interface values are known, the sub-problems are decoupled and the remain-
ing, interior to each sub-domain, unknowns can be computed independently. In many cases,
building and solving the reduced system involves high computational cost. Many methods
have been proposed to speed up the procedure, such as approximately solving the re-
duced system [SS99], assembling the matrix in parallel using the “local” Schur-complements
[Saa03], using Krylov solvers [GTD08], or exploiting the structure of the decomposition to
simplify the problem.
In general, the Schur-complement method can be described by the flowchart of Fig. 3.2
(where M is the number of sub-domains). This method can be easily parallelized as sketched
3.2. DDM CHARACTERISTICS 43
Parallel threads
no
yes global
convergence Update interface values
in the same figure. Operations without data dependencies, such as the formulation and
update of the local linear systems, the elimination of the interior variables and the solution of
the local systems can be performed concurrently. Nevertheless, the solution of the reduced
system introduces an unavoidable sequential bottleneck in the algorithm, which can hinder
its scalability (see Section 2.5.2).
As noted in Section 3.1, among the simplest and oldest techniques are the Schwarz Alter-
nating procedures. In general, these methods can be described by the flowchart of Fig. 3.3.
These methods work by freezing the external interface variables during the solution of
each sub-domain, thereby making the sub-problems totally decoupled. Hence, no reduced
system has to be formulated and solved. This formulation is more attractive for parallel
implementations since the solutions of the sub-domains can be performed in parallel and
information exchange happens only when updating the interface values and checking global
convergence. Other variants of this method can be found in the literature depending on how
often and in which order the interface variables are updated, for instance the Additive or
Multiplicative Schwartz procedures [Saa03].
Although these algorithms allow for a more coarse-grained1 parallelization and have a
smaller data exchange rate than the Schur-complement ones, the speed of global conver-
gence of the system can suffer if certain aspects are not taken into consideration [BK99,
PLGMH11]. For example, even if the method converges for any chosen partitioning, the
choice of the partitioning has a great influence on the rate of the convergence: tightly cou-
pled sub-domains can initiate many iterations as the interface values change strongly and,
1 Parallelapplications are often classified according to how often their subtasks need to synchronize or com-
municate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many
times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and
it is embarrassingly parallel if they rarely or never have to communicate. Embarrassingly parallel applications
are considered the easiest to parallelize and in this category belongs the simulation of several independent
contingencies during a dynamic security assessment.
44 CHAPTER 3. DDMS AND THEIR APPLICATION TO POWER SYSTEMS
eventually, slow down the procedure. Choosing carefully the partitioning scheme and the
initial values, using preconditioners and overlapping sub-domains can remedy some of these
problems. Other, problem specific, considerations involve choosing the right interface vari-
able transmission conditions and discretization method [Woh01, TW05, CRTT11].
3.3.1 Partitioning
The choice of the decomposition plays important role in the speed of convergence, the load
balancing among parallel tasks and the overall performance of the DDM. As discussed in
Section 3.2, the automatic partitioning of systems described by DAEs is not trivial. Several
partitioning schemes have been proposed in power system dynamic simulations.
Some partitioning methods are based on coherency analysis [Pod78, HR88, KRF92,
YRA93, JMD09] and take into consideration the dynamic behavior of the system to en-
sure that the resulting sub-domains are “weakly” coupled. However, this type of partition-
ing is strongly dependent on the post-fault power system topology and the initial operating
point. It is also computationally demanding to update the coherent groups after major system
changes.
Other algorithms try to partition the power system in order for the admittance matrix of
the resulting sub-networks to be in block border diagonal form (see Fig. 3.5). Such methods
are factorization path tree partitioning [Cha95], tearing using simulated annealing [IS90],
clustering by contour tableau [SVCC77], and seed nodes aggregating [VFK92].
Finally, some methods [MQ92, CRTT11, AF12, ASC13] build a graph representation of
the system and try to partition it with the objective of minimizing the edge cuts while having
balanced sub-domains.
DDMs in
Power System Simulations
Fine-grained Coarse-grained
(Based on linear system) (Based on topology)
-High rate of communication among threads -Small rate of communication -Average rate of communication
-Small amount of work in parallel among threads among threads
-Large amount of work in parallel -Average amount of work in parallel
-Sensitive to partition scheme -Less sensitive to partition scheme
3.3. EXISTING APPROACHES IN POWER SYSTEM DYNAMIC SIMULATIONS
Figure 3.5: Block bordered diagonal form matrix for power system dynamic simulation
With the huge developments in parallel computing in the 80’s, researchers started investigat-
ing the use of parallel resources to accelerate the solution of sparse linear systems, present
in many engineering problems. Several parallel solvers have been developed since then,
either of general purpose or customized addressing a specific matrix structure.
In power system dynamic simulations, decomposition methods have been used to to
identify and exploit the particular structure of sparse matrices and efficiently parallelize the
solution procedure. Some methods, like parallel VDHN [CB93], Newton W-matrix [YXZJ02]
and parallel LU [Cha01a, CDC02], divide the independent vector and matrix operations in-
volved in the linear system solution over the available computing units. These methods solve
the exact linear system (e.g. Eq. 1.15) but in parallel. Other methods, like parallel successive
over relaxed Newton [CZBT91] and Maclaurin-Newton [CB93], use an approximate (relaxed)
Jacobian matrix with more convenient structure for parallelization. Afterwards, several meth-
ods were proposed inspired by different hardware platforms and parallel computing memory
models [TC95, SXZ05, FX06, JMD10].
While the fine-grained parallelization methods provide some speedup, they don’t exploit
the full parallelization potential of power system dynamic simulations. First, only paralleliza-
tion opportunities deriving from linear algebra operations are exploited; leaving the proce-
dures performing the discretization and algebraization of equations, the treatment of discrete
events, the convergence check, etc. in sequential execution. Furthermore, as these methods
handle the sparse matrix directly, it is really hard to perform partial system updates or exploit
the localized response of power systems to certain events (such acceleration techniques will
be presented in Chapters 4 and 5). Finally, due to the fine-grain partitioning of tasks there
3.3. EXISTING APPROACHES IN POWER SYSTEM DYNAMIC SIMULATIONS 47
is a high rate of communication among the different parallel workers, leading to increased
overhead costs. For these reasons, coarse-grained DDMs were preferred in our work.
In coarse-grain methods the decomposition is not projected to the linear system but to the
power system DAEs of Eq. 1.1. This leads to the formulation of the DAE sub-systems de-
scribing the problem in the resulting sub-domains. Then, the solutions of the decomposed
sub-systems are obtained using a Schwarz or Schur-complement method for treating the
interface variables, as described in Section 3.2.
Historically, Schwarz methods were preferred for almost all coarse-grain DDMs for power
system dynamic simulations. These schemes are easier to formulate and solve, they avoid
the sequentiality of Schur-complement methods, and their low rate of communication leads
to decreased overhead costs. However, their performance is very sensitive to the particular
partitioning and initial operating point.
The first to envisage an application to power systems was probably Kron with the di-
akoptics method [Kro63, Hap74, Ait87]. At the time it was first proposed (1960’s), parallel
computing was not an option and the target was to address memory issues, but, this method
provided the ignition for many of the parallel methods to follow.
First, the network (domain) is teared (decomposed) to create sub-networks (sub-do-
mains). The partitioning is performed using a graph representation of the network and cutting
over connecting lines (edges of the graph). It consists of two types [Hap74]: i) the resulting
sub-domains form a radial network when the cut lines are removed, and, ii) the resulting
sub-domains are no longer connected when the cut lines are removed. Then, the resulting
sub-problems are solved independently as if they were completely decoupled. Finally, their
solutions are combined to yield the solution of the original problem.
The diakoptics belongs to the family of parallel-in-space methods: the formulation of
the problem at a single time instant is decomposed and solved. Moreover, the interface
variables are updated through a Schwarz approach. The performance of this method is
severely decreased as the network to be partitioned is more meshed.
Another family of methods is the parallel-in-time [Alv79, LBTT90, LST91, LSS94, ILM98].
These introduced the idea of exploiting parallelization in time and have their origins in similar
techniques proposed for the solution of ODE problems [Nie64]. So, despite the sequential
character of the initial value problem which stems from the discretization of differential equa-
tions, these techniques propose the solution of multiple consecutive time instants in parallel.
Thus, the DAE systems describing the problem at each instant are solved concurrently. The
interface variables exchanged refer to the transfer of the values needed by the discretization
scheme from one time instant to the next, as shown in Fig. 3.6. For example, with a two-
step discretization scheme information from the two previous time instants would be always
needed.
48 CHAPTER 3. DDMS AND THEIR APPLICATION TO POWER SYSTEMS
2
Solutions
3
4
P1 P2 P3 P4
Later on, the Waveform Relaxation (WR) [ISCP87, CIW89, CI90] method proposed to de-
compose the system in space and solve each sub-domain for several time instances before
exchanging the interface values. Hence, these values consist of waveforms from neighboring
sub-systems (i.e. a sequence of interface values over a number of consecutive time instants).
After each solution, waveforms are exchanged between neighboring sub-systems, and this
process is repeated until global convergence. Thus, these schemes follow a Schwarz ap-
proach to treat the interface variables.
Consider the DAE initial value problem of (1.1) solved over the time window [t0 , t0 + TW ]:
Γẋ = Φ(x, V )
0 = Ψ(x, V )
x(t0 ) = x0 , V (t0 ) = V0
Decomposing the power system into M sub-domains yields M sets of equations, each
one describing the problem inside its partition:
The variables with superscript k + 1 in the sub-sets are unknowns and have to be calcu-
lated over the window [t0 , t0 + TW ], while the variables with superscript k are known from the
previous WR iteration and make up the interface variables exchanged between the systems
at the end of each iteration.
For the solution of the system (3.1) two algorithms were proposed [CI90]: Gauss-Jacobi
WR Algorithm and Gauss-Seidel WR Algorithm (see Fig. 3.7). The former uses boundary
3.3. EXISTING APPROACHES IN POWER SYSTEM DYNAMIC SIMULATIONS 49
conditions only from the previous iteration while the latter uses the most recent boundary
conditions available. This way, the first one is inherently suitable for parallel implementation,
while the latter for sequential. The convergence of these algorithms strongly depends on
the partitioning and on the window time interval TW [PLGMH11]. Their properties have been
thoroughly investigated in [CIW89, BDP96, BK99, GHN99, JW01, EKM08, CRTT11] and
some techniques to accelerate this method in [PLGMH11].
If the time interval [t0 , t0 + TW ] of the WR is restricted to only one time step, then the In-
stantaneous Relaxation (IR) method [JMD09] is formulated. However, as a Schwartz method,
the convergence properties of IR heavily depend on the proper selection of the partition. Fur-
thermore, in the proposed IR method, each sub-system is solved until convergence but the
global convergence of the system is not checked before proceeding to the next time instant.
This is made possible due to the small time steps used for the simulation (~1 ms), but could
lead to inaccuracies. An extension of this method was proposed in [JMZD12], combining both
coarse-grained and fine-grained parallelization in a nested way to increase performance.
Finally, methods such as parallel-in-time-and-space [LBTC90], Parareal-Waveform Re-
laxation [CM11], and Two-Stage Parallel Waveform Relaxation [LJ15] propose the decompo-
sition of the system in space and in time, to exploit a higher level of parallelization and further
speed-up the simulations.
The DDMs proposed in this thesis are coarse-grain and, unlike most of the algorithms
proposed until now, make use of the Schur-complement approach to treat the interface vari-
50 CHAPTER 3. DDMS AND THEIR APPLICATION TO POWER SYSTEMS
ables. This approach makes the algorithm performance less dependent on the selected
partition and allows to achieve high convergence rate. These will be detailed in Chapters 4
and 5.
3.4 Summary
DDMs have been extensively used in engineering for the solution of complex problems de-
scribed by PDEs, ODEs, and DAEs. The various proposed methods might differ in the type of
domain partitioning, the procedure to solve the sub-problems formulated over the partitions,
and the way of handling the interface variables. However, the common factor is that they try
to exploit parallel computational resources and accelerate the simulation procedure.
A classification of DDMs applied for power system dynamic simulations has been pro-
posed in this chapter followed by a brief description of these methods. Even though several
parallel DDMs have been proposed over the years, to our knowledge, industry-grade soft-
ware do not employ such methods. We believe that the main reason for this absence is the
lack of robustness in the solvers and the need for the user to go through a highly complex
procedure of parameter and partition selection. Most of the proposed DDMs heavily depend
on the selected partition scheme and the latter can easily lead to convergence problems or
inaccuracy.
C HAPTER 4
Parallel Schur-complement-based
decomposition method
4.1 Introduction
51
52 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
M
Network
~/=/~
more accurate and robust. Finally, the theoretical basis is set to analyze the convergence of
the proposed algorithm and how it is affected by the localization techniques.
The first step in applying a DDM is to select the domain partitioning. For the proposed al-
gorithm, the electric network is first separated to create one sub-domain by itself. Then,
each component connected to the network is separated to form the remaining sub-domains.
The components considered in this study refer to devices that either produce or consume
power in normal operating conditions and can be attached to a single bus (e.g. synchronous
machines, motors, wind-turbines, etc.) or on two buses (e.g. HVDC lines, AC/DC convert-
ers, FACTS, etc.). Hereon, the former components will be simply called injectors and the
latter twoports. In the following material, twoports will be explicitly referred only when their
treatment is different from the injectors; otherwise, only the injectors will be referred.
Components with three or more connecting buses have not been considered in this work;
however, such components can be treated as a combination of twoports. For instance, a
component connected to three buses can be treated as three twoports. Nevertheless, such
components are rarely used in phasor mode power system dynamic simulations and are not
present in the test systems considered in Section 1.3.
The proposed decomposition can be visualized in Fig. 4.1. The scheme chosen reveals
a star-shaped, non-overlapping, partition layout. At the center of the star, the network sub-
domain has interfaces with many smaller sub-domains; while, the latter interface only with the
network sub-domain and not between them. As it will be seen later on, this type of partitioning
facilitates and simplifies the use of the Schur-complement approach to treat the interface
variables. Based on this partitioning, the problem described by Eq. 1.1 is decomposed as
follows.
The network sub-domain is described by the algebraic equations:
4.3. SUB-SYSTEM SOLUTION 53
0 = Ψ(xext , V )
(4.1)
xext (t0 ) = x0ext , V (t0 ) = V0
while the sub-problem of each injector can be described by a DAE IVP (i = 1, ..., N):
where xi and Γi are the projections of x and Γ, defined in (1.1), on the i-th sub-domain. Fur-
thermore, the variables of each injector xi are separated into interior xint
i and local interface
xiext variables and the network sub-domain variables V are separated into interior V int and
local interface V ext variables. The external interface variables of the network are the rect-
angular components of the injector currents (see Section 1.2.6), while, the external interface
variables of each injector are the coordinate voltage components of the connection bus.
An important benefit of this decomposition is the modeling modularity added to the sim-
ulation software and the separation of the injector modeling procedure from the solver. The
predefined, standardized interface between the network and the injectors permits for the
addition or modification of an injector in an easy way.
where the external interface variables are the rectangular components of the injector currents
(Ii ) and Ci is a trivial matrix with zeros and ones whose purpose is to extract the interface
variables from xi .
Then, the injector DAE sub-systems (4.2) are algebraized using a differentiation formula
(in this work the second-order BDF) to get the corresponding non-linear algebraized systems:
Next, at each discrete time instant tn , each of the N + 1 sub-systems is solved using a
Newton method. Thus, at the k-th Newton iteration the following linear systems are solved:
N
D ∆V k − ∑ Ci ∆xik = −g (xk−1 , V k−1 ) (4.5)
i =1
which can be detailed as (the iteration superscripts have been ignored for better legibility):
!
int
! 0 int (V int , V ext )
!
D1 D2 4V g
− N =− (4.7)
D3 D4 4V ext ∑ C̃ i 4 x ext
i g ext (V int , V ext , xext )
| i=1 {z
| {z }| {z } | {z }
D 4V } g
N
∑ Ci ∆xi
i =1
! ! ! !
A1i A2i 4xint 0 fiint (xint ext
i , xi )
i
+ =− (4.8)
A3i A4i 4xiext B̃i 4V ext fiext (xint ext
i , xi , V
ext )
| {z }| {z } | {z } | {z }
Ai 4 xi B i ∆V fi
where, A1i (resp. D1 ) accounts for the coupling between the sub-domain’s interior variables;
A4i (resp. D4 ) express the coupling between local interface variables; A2i and A3i (resp. D2
and D3 ) represent the coupling between the local interface and the interior variables; and,
B̃i (resp. C̃ j ) describe the coupling between the local interface variables and the external
interface variables of the adjacent sub-domains.
The iterative solution of the above Newton equations gives the values V (tn ) and x(tn ).
However, it can be seen that these solution steps are coupled through the interface variables
V ext , xext and cannot be solved independently. Thus, a Schur-complement approach is
used at each iteration to build and solve a reduced system to obtain the interface variables.
Once these are computed, the treatment of the N injector sub-domains is decoupled and
can be performed independently.
reduced system by the size of V int but retains sparsity. In this work, this second option was
chosen as it allows to use fast sparse linear solvers for the solution of the reduced system.
So, the reduced system to be solved, with the previous considerations, takes on the form:
4x1ext f˜1
S1 0 0 ··· 0 B̃1
˜
0 S2 0 ··· 0 B̃2 4x2ext f2
f˜3
0 0 S3 ··· 0 B̃3 4x3ext
.. .. .. .. .. .. .. =− . (4.10)
. ..
. . . . .
.
4V int
int
0 0 0 · · · D1 D2 g
−C̃1 −C̃2 −C̃3 · · · D3 D4 4V ext g ext
Due to the star layout of the decomposed system, the resulting global Schur-complement
matrix is in the so-called block bordered diagonal form. Manipulating this structure, which
is a unique characteristic of star-shaped layout decomposition, the interface variables of the
injector sub-domains can be eliminated and only the variables associated to the network
sub-domain remain. This results in the simplified, sparse, reduced system:
D1 D2 ! g int
N
∆V int
N
= − ext
D3 D4 + ∑ C̃i Si−1 B̃i ∆V ext g + ∑ C̃i Si−1 f˜i
(4.11)
i =1 | {z } i =1
| {z } | {z }
De ∆V ge
The reduced system (4.11) can be solved efficiently using a sparse linear solver to acquire
∆V . Then, the network sub-domain interface variables are backward substituted in (4.8) and
the latter are used to compute the variables ∆xi of each injector sub-domain independently.
56 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
The nonzero structure of the elimination terms C̃i Si−1 B̃i depends on the number of buses
the injector is attached to. If it is an injector attached to a single bus, the elimination con-
tributes four elements; the interfacing variables are the two injector current components
(i x , iy ) and the two bus voltage components (v x , vy ). This term modifies only four, already
non-zero, elements of sub-matrix D4 , thus retaining its original sparsity pattern. This is
shown in Fig. 4.2a for an injector attached to the k-th bus.
On the contrary, if the injector is attached to two buses (twoport), the elimination term
C̃i Si−1 B̃i contributes with 16 non-zero elements: eight of them modifying already non-zero
elements of sub-matrix D4 and the other eight creating a fill-in connecting the two buses.
Of course, if the two buses were already connected, for instance with a line or transformer,
the aforementioned elements are already non-zero and no fill-in terms are created by the
Schur-complement terms. This is shown in Fig. 4.2b for a twoport attached to the k-th and
l-th buses.
Even though the twoports might introduce a fill-in to D,
e the computational burden of
solving (4.11) is not significantly affected by them. First, the number of twoports in com-
mon power system models is limited compared to their overall size, thus the percentage of
fill-ins is also small. Moreover, only sub-matrix D4 is affected by the fill-ins leaving the re-
maining sub-matrices unaffected. Finally, the sparsity pattern of matrix De remains the same
throughout the entire simulation, thus the sparse matrix structure analysis (optimal ordering)
is performed only once.
Similarly, the mismatch correction term C̃i Si−1 f˜i contributes only with two elements to
g ext for an injector attached to one bus or with four elements for a twoport.
For the sake of generality and to create a link to publications [AFV13a, AV13, AV14b],
Eq. 4.11 can be rewritten as:
N N
(D + ∑ Ci Ai−1 Bi ) ∆V k = − g (xk−1 , V k−1 ) − ∑ Ci Ai−1 f i (xik−1 , V k−1 ) (4.12)
i =1 i =1
where:
" #" # −1 " #
0 0 A1i A2i 0 0
Ci Ai−1 Bi =
0 C̃i A3i A4i 0 B̃i
0 0
=
−1
−1
0 C̃i A4i − A3i A1i A2i B̃i
" #
0 0
=
0 C̃i Si−1 B̃i
and:
4.5. PARALLEL ALGORITHM 57
# −1
−1 −1
−1 −1 −1
"
A1i A2i A1i − A2i A4i A3i −A1i A2i A4i − A3i A1i A2i
=
−1 −1
A3i A4i −1
− A4i − A3i A1i A2i −1
A3i A1i −1
A4i − A3i A1i A2i
While both Eqs 4.11 and 4.12 are equivalent, the latter doesn’t make any assumption
on the ordering of the interior and interface variables. Hence, the elimination of the interior
variables shown in (4.9) is implied through the proper selection of matrices Ci and Bi .
Remaining
BLOCK A
BLOCK D
BLOCK C
BLOCK B
BLOCK A
First, this technique is used within one discretized time instant solution to stop computations
of injectors (resp. reduced system) who have already been solved with the desired tolerance.
That is, after one decomposed solution of (4.5) and (4.6), the convergence of each injector
and the reduced system is checked individually. If the convergence criterion is satisfied,
then the specific sub-systems are flagged as converged. For the remaining iterations of the
current time instant, the sub-system is not solved, although its mismatch, computed with (4.4)
or (4.11), is monitored to guarantee that it remains converged. This technique decreases the
computational effort within one discretized time instant without affecting the accuracy of the
solution (see Sections 4.7 and 4.9). Thus, BLOCKS B and C in Fig. 4.3 are replaced by the
block in Fig. 4.4.
From a mathematical point of view, the state corrections of converged injectors are set to
zero (∆xi = 0), thus the RHS fi of (4.6) and the sensitivity to voltage deviations Bi are set
to zero. At the same time, for a converged reduced system the voltage corrections are set to
zero (∆V = 0), thus in Eq. 4.11 the RHS ge is set to zero.
Taking advantage of the fact that each sub-domain is solved by a separate quasi-Newton
method (in this work, the VDHN presented in Section 1.2.5.2), the sub-system update criteria
are decoupled and their local matrices (such as Ai , Bi , D), as well as their Schur-comple-
ment terms, are updated asynchronously. In this way, sub-domains which converge fast keep
the same local system matrices for many iterations and even time-steps, while sub-domains
which converge slower update their matrices more frequently. Thus, BLOCK A in Fig. 4.3 is
replaced by the block in Fig. 4.5.
60 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
V V
Traditional update criteria of VDHN methods are reused to trigger the sub-system ma-
trices computation. That is, if the sub-system has not converged after five iterations of the
algorithm presented in Section 4.5, the local matrices are updated. Moreover, an update
of the matrices is triggered if a change of the sub-system equations, caused by a discrete
event, is detected. Of course, after a severe event in the system (such a short-circuit, the
tripping of a generator, etc.) or when the time step used for the discretization of Eqs. 4.2 is
changed, an update of all the matrices and the Schur-complement terms is forced to avoid
convergence problems.
4.6.3 Latency
Localization is also exploited over several time steps by detecting, during the simulation, the
injectors marginally participating to the system dynamics (latent) and replacing their dynamic
models (4.2) with much simpler and faster to compute sensitivity-based models. At the same
time, the full dynamic model is used if an injector exhibits significant dynamic activity (active).
The two models are shown in Fig. 4.6.
The sensitivity-based model is derived from the linearized Eqs. 4.6 when ignoring the
internal dynamics, that is fi (xik−1 , V k−1 ) ' 0, and solving for the state variation ∆xi :
where Ei (similarly to Ci ) is a trivial matrix with zeros and ones whose purpose is to extract
the injector current variations from ∆xi and Gi is the sensitivity matrix relating the current
with the voltage variation.
Selecting an arbitrary instant t∗ , the linear relation (4.13) can be rewritten as:
The linear model (4.14) is a valid estimate of the full dynamic model (4.2) only when the
injector shows low dynamic activity (thus, the RHS of Eq. 4.6 can be neglected) and only for
small deviations around the linearization point (thus, Gi can be considered constant). How-
ever, as it will be shown later on, this technique can introduce some error into the simulated
response.
It is important to note that, the Schur-complement term contributed by the linear model
(4.14) to matrix D̃ of Eq. 4.11 is the same as the one of model (4.6). This means that
switching from one model to the other doesn’t require to recompute and factorize the Schur-
complement matrix D̃.
The essence of the algorithm lies in its ability to detect the switching of injectors from active to
latent, and conversely. During the dynamic simulation the state vector values x(t j ) and V (t j )
are known for t1 , ..., t j , ..., tn , with tn the last computed discrete time. The injector switching
criteria have to be robust and based only on currently available information. Furthermore, as
the algorithm aims for higher simulation performance, the criteria computations need to be
fast and use as little memory as possible.
Since the injectors interact with the network and between them through the current and
voltage changes (see Eq. 4.3), power flow variations can be used as an indication of dynamic
activity. Therefore, the variation of the per-phase apparent power (Si = Pi 2 + Qi 2 ) flowing in
p
each injector was naturally selected as the monitoring variable representative of the injector
dynamic activity. Alternatively, more detailed information could be extracted from the active
(Pi ) and reactive (Qi ) powers but at the cost of doubling the computing effort and memory
usage.
Simply stated, an injector is declared latent when its apparent power Si has “not changed
significantly for some time” or, in other words, exhibits small variability. The Si values are
available as time-series samples. Thus, traditional methods for analyzing time series data
can be employed to characterize the variability of Si over a pre-specified, moving, time win-
dow (TL ). This procedure is shown in Fig. 4.7.
The choice of using a moving time window and not the entire history aims at disregarding
the oldest “behavior” of an injector and involving only recently observed dynamics. However,
if the time window is very small, smooth variations (i.e. with low rates of change) may not be
detected.
The main characteristics extracted from the time-series are the sample average value
(Si,av ), variance (Si,var ), and standard deviation (Si,std ). In particular, the standard deviation is
the measure of volatility that shows how much variation or dispersion exists from the average.
A small standard deviation indicates that the data points tend to be very close to the average,
whereas high standard deviation indicates that the data points are spread out over a large
range of values. Consequently, standard deviation of Si smaller than a selected tolerance
62 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
Injector
e L is an indication that the i-th injector exhibits low dynamic activity and can be considered
latent.
After an injector is classified as latent at time t∗ , the apparent power is no longer dictated
by the dynamic model (4.2) but by the linear model (4.14), affected only by the deviation of
the voltage. Therefore, the standard deviation is not reliable to switch the injector back to
active mode since slow voltage changes can gradually “drift” the injector’s operating point
away from the reference without the standard deviation ever increasing. To avoid this, the
absolute deviation of the apparent power from its reference value Si (t∗ ) is used. If the abso-
lute variation is bigger than the same tolerance e L , meaning that the model has moved away
from the linearization point, the injector returns to active mode.
Large power systems may involve thousands of injectors, making it inefficient to keep all
injector apparent powers in memory over the moving time window. For example, a small time
window TL = 5 s has 250 points when the system simulated with a time step of 20 ms. Thus,
in the Hydro-Québec system of Section 1.3, which has 4601 injectors, this translates to more
than one million points that need to be kept in memory and updated at every time step.
Furthermore, calculating the exact average and standard deviation values of the time-win-
dow samples at each time-step leads to complex bookkeeping and time consuming compu-
tations. To avoid this, an approximation is considered, which originates from real-time digital
signal processing where computing and memory resources are scarce. When a new sample
Si (tn ) is available, a moving average value and standard deviation can be computed with a
4.6. LOCALIZATION TECHNIQUES 63
Si(tn-1)
Si(tn-2)
Si(tn)
Si(tn-3)
repeated application of the Exponential Moving Average (EMA) [Mul01]. This is a weighted
moving average with the weights decreasing exponentially. Each sample is valued some
percent smaller than the next more recent sample. With this constraint the moving average
can be computed very efficiently.
There are several definitions of EMA [Eck13]. Due to the fact that the time step hn =
tn − tn−1 between the samples could vary, the weight of each sample changes according to
the fraction of the time step over the observation time window TL . Thus, the EMA operator is
defined recursively as [Mul01]:
hn
where Si includes the last two values of Si , α = and λ1 = e−α . The variable λ2 depends
T
on the series type and the interpolation method selected:
1 − λ1
for the linear interpolation
α
for taking the preceding series value
1
√
λ2 = λ1 for taking the nearest series value (4.16)
for taking the subsequent series value
λ1
for equally spaced discrete time series (no interpolation)
λ1
In this work, it is assumed that the value of Si varies linearly between any two successive
1 − λ1
time-steps as shown in Fig. 4.8, thus λ2 = .
α
Even though the exponential moving variance can be obtained as [HWYC09]:
the same reasoning as for formula (4.15) can be also used [Mul01]:
Si,var (tn ) = λ1 Si,var (tn−1 ) + (1 − λ1 )∆Si2 (tn ) + (λ1 − λ2 ) ∆Si2 (tn ) − ∆Si2 (tn−1 ) (4.18)
64 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
49
48
47
MVA
46
45 Si
Si,av
44
0 40 80 120 160 200 240
time (s)
Figure 4.9: Si and Si,av of a synchronous machine after a fault
where ∆Si (tn ) = Si (tn ) − Si,av (tn ). Both formulas were tested in RAMSES and no significant
differences were found. Equation 4.18 is used for the investigation following.
Finally, the standard deviation can be obtained by:
q
Si,std (tn ) = Si,var (tn ) (4.19)
This value can be used to assess the volatility and thus the dynamic activity of an injector. For
example, Fig. 4.9 shows the long-term evolution of the apparent power and corresponding
exponential moving average of a synchronous machine, after the occurrence of a three-
phase short-circuit near the generator, which was cleared after five cycles (100 ms).
The observation time window of TL = 5 s used for Fig. 4.9 was found to be the smallest TL
providing smooth averaging. Using smaller time windows with the same time-steps increases
the weight of the new value compared to the older ones. During short-term dynamics, it is
imperative to use the full injector model that can capture the fast transients. However, a
small TL can lead injectors to switch latent during this period and introduce mistakes in the
simulated response. In the cases studied later on, a time-window of TL = 10 s is used as
it provides a compromise between speed of response of Si,std and consideration for recent
dynamic activity.
The decision for switching between active and latent mode is taken after solving the system
equations for each time step tn . Then, the selected models are used for the computation of
the states at tn+1 . During the DDM iterations, the state of each injector (latent or active) does
4.6. LOCALIZATION TECHNIQUES 65
not change, as switching could perturb the Newton iterations and cause divergence. The
complete procedure is given by the following algorithm:
An example of the switching procedure is shown in Fig. 4.10. In the upper plot, the Si,std
is shown along with the latency threshold e L , here set to 0.1 MVA. At time t = 107 s, the
threshold is crossed and the injector switched to latent. In the lower plot, it can be seen
that the apparent power of the latent injector remains within the deadband centered on the
linearization point, until t = 125 s when it is switched back to active mode.
If latency is used during the simulation in combination with the previous technique, the
BLOCK C of Fig. 4.3 is replaced by the block of Fig. 4.11.
The parameter e L controls the approximation introduced into the simulation (e L = 0 re-
sults in fully accurate simulation). If the power system includes injectors of both very small
and very large powers then e L must remain small to keep the error bounded. On the other
hand, if the system involves similarly sized injectors, then e L can be increased without intro-
ducing large errors.
Figure 4.12 shows the apparent power of a synchronous machine with latency e L =
0 MVA and e L = 0.1 MVA applied to the entire system. The portions with gray background
denote the areas when the pictured generator has turned latent. The corresponding relative
error on the simulated evolution is shown in Fig. 4.13. It can be seen that the inaccuracy
introduced by latency is negligible, and it is so for most dynamic simulations.
The injector shown in Fig. 4.12 switches to latent for first time at t ≈ 130 s. However, other
injectors in the system get latent earlier. This is the reason for the error shown in Fig. 4.13
prior to t ≈ 130 s.
From a mathematical point of view, the effect of latency to the convergence of the simu-
lations will be examined in the next section; while, a more practical analysis of the error in-
troduced, the simulation performance and the proper selection of parameter e L will be given
in Section 4.9.
66 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
Figure 4.11: BLOCK C with skipping converged injectors and latency techniques
By using the Schur-complement approach detailed in Section 4.4, an exact solution of Eqs. 4.5-
4.6 is performed at each iteration of the parallel algorithm of Fig. 4.3 (BLOCKS A-D). If these
equations are grouped a unique linear system, the following equivalent integrated system is
4.7. EFFECTS OF LOCALIZATION TECHNIQUES ON CONVERGENCE 67
50 L : 0.0 MVA
L : 0.1 MVA
48
46
44
S (MVA)
48.5
42
47.5
40
46.5
38
45.5
90 100 110
36
Figure 4.12: Apparent power of synchronous machine with and without latency
0.40
Relative error (ǫL =0.1 MVA, T=5 s)
0.35
0.30
0.25
Serr (%)
0.20
0.15
0.10
0.05
0.00
0 40 80 120 160 200 240
time (s)
This system is the same as the linear system (1.15), solved by a classical simultaneous
solution presented in Section 1.2.5.2, using the same discretization scheme as the DDM
68 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
and performing some row and column permutations. That is, the DDM-based solution pre-
sented in this chapter is mathematically equivalent to solving together the set of non-linear,
discretized Eqs. 4.3 and 4.4 with a quasi-Newton method.
This equivalence between the DDM-based and the simultaneous approach, allows using
the extensive theory behind quasi-Newton schemes to assess the proposed algorithm’s con-
vergence. Under some well-studied requirements [Bro70, BDM73, DM74, GT82, Mor99], the
iterative Newton schemes converge to the solution at a super-linear rate.
However, the localization techniques modify the equivalent system (4.20). First, the ma-
trices of each injector and of the network are not updated synchronously but according to the
local convergence of each sub-system (see Section 4.6.2). Thus, modifying the system into:
k k k
A1k1 0 0 ··· B1k1 4x1 f1
A2k2 ··· B2k2 4x2
0 0
f2
0 0 A3k3 ··· B3k3
4x3
=−
f3 (4.21)
.. .. .. .. .. .. ..
. . . . .
.
.
− C k N +1 − C k N +1 − C k N +1 · · · D k N +1 4V g
1 2 3
where k i ≤ k (i = 1, ..., N + 1) is the iteration at which the i-th injector or the network matrices
were last updated. It should be noted that the matrices are often unchanged over several
time instants, not only iterations. Thus, this modified system can be treated a quasi-Newton
method with a special Jacobian update scheme. The error introduced to the Jacobian by
the asynchronous update of the sub-domain matrices is minimal, and while it can affect the
convergence rate of the Jacobian, it does not affect the final solution.
Next, by also considering the skip-converged and latency techniques (see Sections 4.6.1
and 4.6.3) the system (4.21) is modified to:
k k k k
A1k1 0 0 ··· B1k1 4x1 f1 r1
A2k2 B2k2
0 0 ···
4x2
f2
r2
A3k3 B3k3
r
··· 4x3 f3
0 0
=−
+
3
...
.. .. .. .. .. .. ..
. . . . .
.
.
− C k N +1 − C k N +1 − C k N +1 · · · D k N +1 4V g r N +1
1 2 3
(4.22)
That is, skipping the solution of a converged injector or the reduced system is equivalent to
setting to zero the off-diagonal blocks and the mismatches of (4.22). In addition, when an
injector is considered latent, it is equivalent to setting ri = fi . Thus, both techniques consist
in setting:
f
i if latent or converged 0 if converged
ri = Biki =
0 otherwise B k i otherwise
i
4.8. PARALLELIZATION SPECIFICS 69
g if converged k
0 if converged
r N +1 = Ci N + 1 =
0 otherwise C k N +1 otherwise
i
These changes lead to an inexact Newton scheme and its convergence properties and
assumptions can be examined as shown in Appendix A.
Comparing the skip-converged and latency techniques, it can be seen that the former
applies more strict criteria and does not affect the final solution of the algorithm. More specif-
ically, in RAMSES if the i-th injector has converged and is not solved anymore, its mismatch
is still computed at each iteration and rik is updated. If at any iteration rik increases compared
to rik−1 , then the injector is solved again. Thus, relating to Eq. A.11, if skip-converged is used
alone, the following condition can be easily checked to ensure the convergence:
rk
<η<1 ∀k ≥ 0 (4.23)
kF (y k )k
On the contrary, the latency technique does not rely on numerical criteria but rather on
observations of the component dynamic response. The methods detailed in Section 4.6.3.2
to switch between active and latent injectors are approximate. Although a small standard
deviation of the apparent power output of an injector is an indication of low dynamic activity,
it does not guarantee that the equivalent model used will provide exactly the same results
as the full dynamic model. Especially when a large latency tolerance is used, the error
introduced in the system (4.22) could affect the accuracy of the solution.
computed as:
TS + TP TB + TA + TC + TD 1+9+8+9
Scalability4 = = = = 3.6 (4.24)
TS + TP
4 TB + TA
4 + TC
4 + TD
4
1 + 94 + 84 + 94
TP
where TS + TP = T1 = 27 is the sequential execution time (M = 1) and TS + = 7.5 is the 4
expected execution time on four threads. The maximum scalability can be computed with:
!
TS + TP 27
Scalability∞ = lim T
= = 27 (4.25)
M→∞ TS + MP 1
noted with triangles. Finally, some injectors have already converged in this iteration, thus
they are noted in rhombus and are skipped. The fictional computation costs assumed are
given in Fig. 4.15.
Therefore, using the work-span model to recompute scalability gives:
T1 20.2 work T 20.2
Scalability4 = = = 2.8 ≤ = 1 = = 5.05 = Scalability∞ (4.27)
T4 7.2 span T∞ 4
where it can be seen that both the execution on one thread (T1 ) and on four threads (T4 ) are
accelerated. Again, for this calculation the assumptions of ideal load balancing and no OHC
were used.
Nevertheless, while the localization techniques decrease the overall simulation time (T1 =
20.2 < 27 and T4 = 7.2 < 9), the scalability of the algorithm is also decreased. The reason
behind this is that the work (T1 ) decreases immediately even with one reduced task (skip-
converged, latent, etc.), while the span is harder to decrease. Taking the upper limit, T∞
will only decrease if all tasks are reduced. That is, if only a single task per parallel BLOCK
remains unreduced, T∞ will be the same as if no task was reduced (in our case T∞ = 4).
This observation introduces a trade-off between scalability and using localization tech-
niques to accelerate the simulation. However, as long as TM is decreasing, the speedup
which is calculated against T1∗ (the run-time of the program with one worker using the fastest
-or a very fast -sequential algorithm), is increasing. This complex behavior will be further
examined in Section 4.9 using time measurements from the test systems.
Figure 4.16: Variant of Fig. 4.14 with three different types of injectors
T1 47 work T 47 ∼
Scalability4 = = ' 2.8 ≤ = 1 = = 5.2 = Scalability∞ (4.28)
T4 17 span T∞ 9
In the second balancing, the tasks are split among threads so as to decrease T4 , that is the
balancing mechanism is aware of the individual cost of each task. This leads to an improved
4.8. PARALLELIZATION SPECIFICS 73
scalability:
T1 47 work T 47 ∼
Scalability4 = = ' 3.1 ≤ = 1 = = 5.2 = Scalability∞ (4.29)
T4 15 span T∞ 9
Of course the maximum scalability is exactly the same as it is dictated by the most compu-
tationally costing task of each BLOCK, and this does not change due to different balancing
strategies.
In the test systems considered in Section 4.9, the size of injector models varies from 2 to
60 variables. This means that the computational tasks will also vary significantly. If the tasks
are not split over the threads properly, scalability will suffer. However, building a task-size-
aware balancing strategy is difficult and would need a lot of bookkeeping and rearranging of
tasks from one parallel segment to the next. In addition, the localization techniques modify
the cost of each task from iteration to iteration, depending on the dynamic behavior of the
system (which is unknown beforehand), thus making the task size less predictable.
In such situations, the dynamic balancing strategy of OpenMP can offer the desirable
load balancing results (see Section 2.6.3). With this strategy, a set of consecutive tasks
called chunk is given to each thread. When a thread has completed its assigned tasks, a
new chunk of tasks is given to it. For example, if the dynamic strategy with a single task per
chunk is used in Fig. 4.16, the best result of “Balancing 2” is obtained. Although the small
chunk selected guarantees a close to optimal balancing, it also means that the threads need
to frequently return and ask for more work, which translates to increased OHC. Moreover,
spatial locality (see Section 2.6.2) suffers as the tasks assigned to each thread are “far” from
each other, thus their data in memory are probably “far” as well.
These problems can be addressed by defining the chunk to be larger than one. This way,
the threads do not return as often to ask for more work, and by treating several consecutive
tasks it is more likely that their data are also consecutive in memory (this is the case in RAM-
SES). Larger chunks decrease the OHC but can cause imbalance among threads as some
chunks might contain more computationally expensive tasks than others. Usually, a compro-
mise is made with a chunk small enough to achieve good load balancing but adequately large
to decrease the OHC and exploit spatial locality. In the simulations of Section 4.9, a default
size of chunk equal to max 4M , 1 was found to be satisfactory through a trial-and-error
N
procedure.
While the dynamic scheduling strategy of OpenMP can address load balancing and spa-
tial locality, temporal locality (see Section 2.6.2) cannot be easily achieved. The reason is
that the tasks treated by each thread, and thus the data accessed, are decided at run-time
and can change from one parallel segment of the code to the next. When executing on UMA
architecture computers, where access time to a memory location is independent of which
processor makes the request, the lack of temporal locality is not crucial.
On the other hand, when executing on NUMA architecture computers, the lack of tempo-
ral locality can introduce a high OHC. In this case, the static scheduling strategy of OpenMP
74 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
can be more effective. With this strategy, each thread always handles the same sub-domains
and accesses the same data at each parallel segment, thus increasing temporal locality. Of
course, chunks are also employed to increase spatial locality and to randomize the distribu-
tion of sub-domains to threads, thus decreasing the possibility that all injectors of the same
type are assigned to the same thread.
Summarizing, in the results of Section 4.9, two different balancing strategies have been
used depending on the computing machine used, as shown in Table 4.1.
In both cases, the first touch memory allocation strategy has been used and each compu-
tational thread is bound to a physical core, as explained in Section 2.6.2. These mechanisms
minimize the transfer of data among different cores, which can lead to increased OHC.
The OHC considered in this work is associated to making the code run in parallel, managing
the threads and the communication between them. In the previous sub-sections, it was
shown that the OHC can be increased by dynamic load balancing (which increases the effort
needed to manage the threads) and the lack of locality (which initiates data transfers and
increases the communication).
However, even if these causes are ignored, there is still OHC relating to the creation and
management of the thread pool 1 at the fork points and the synchronization at the join points
in the code. Some implementations of the OpenMP library (like the one provided by Intel and
used in this work) allow keeping the thread pool alive in the background, during the entire
simulation, to decrease the cost. Nevertheless, the OHC is non-negligible and is dependent
on the number of threads, the operating system, and the computer used.
As long as the parallel work of each thread is still divisible (that is, it consists of more
than one task) and we assume “good” load balancing, Eq. 2.5 of Amdahl’s law can be used
to calculate the scalability without OHC. This can be then compared with the real scalability
calculated with Eq. 2.1 and the difference between them is the OHC, given by:
TS + TP T1
OHC ( M) = TP
− (4.30)
TS + M
TM
where the values of TS and TP can be acquired through a profiling of the algorithm in sequen-
tial execution (M = 1).
1 Group of threads used to compute the tasks.
4.8. PARALLELIZATION SPECIFICS 75
Finally, when executing on NUMA computers, such as Machine 1 (see Table 4.1), the
OHC varies depending on the location of the computational threads within the computer
sockets. For example, Machine 1 consists of four identical sockets, each hosting two NUMA
nodes with six cores (as sketched in Fig. 2.4). Thus, when using six cores assigned to one
NUMA node, the OHC is small as the communication is faster. When using 6 − 12 threads
assigned to a single socket, the OHC is slightly larger. Finally, when threads are assigned
to different sockets, the OHC increases depending on the communication speed between
sockets.
4.8.4 Profiling
In this work, two types of profiling are used: numerical and time. Both profilings consider the
entire simulation and not individual iterations of the algorithm. For example, the operations
sketched in Fig. 4.16 relate to only one out of the thousands of iterations needed to perform
a whole dynamic simulation, while Fig. 4.17 shows several such iterations over three succes-
sive time steps (tn , tn+1 , tn+2 ). Considering each individual iteration in the profiling would be
impossible. Thus, a sum of the same operations over the different iterations is used.
For the numerical profiling, the total number of numerical operations relating to the exe-
cution of the algorithm are considered. For instance, the total numbers of sub-domain matrix
factorizations performed at BLOCK A, of solutions of the reduced system (4.12), etc. The
number of operations does not change when parallelizing them; it changes however when
localization techniques are used. A selection of numerical profilings is shown in Appendix D.
For the time profiling, the time spent in each block of Fig. 4.3 in all iterations is presented
as a percentage of the total time. The profiling is performed using a sequential execution of
the algorithm (M = 1). In addition, there are some other operations shown in the “Remaining
BLOCK ” of Fig. 4.3. These operations relate to updating latency criteria, bookkeeping, treat-
76 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
ing the discrete events, computing the actions of Discrete ConTroLlers (DCTL), etc. (more
details about these operations are given in Appendix C). Some of them are performed in
parallel and others not, thus time profiling these “Remaining” operations is also needed to
correctly calculate the sequential and parallel times Ts and TP . A selection of time profilings
is presented alongside the results in the next section.
In this section some experimental results are given, using the test systems summarized in
Section 1.3. First, the contingency considered in each test system will be described. Next,
several executions of the same simulation will be performed using a combination of simulation
parameters. These are shown in Table 4.2.
In addition, the non-decomposed approach with the use of an integrated VDHN scheme,
described in Section 1.2.5.2, will be used as a benchmark. That is, it will be used to as-
sess the accuracy of the dynamic response and to provide T1∗ in Eq. 2.2, for the speedup
calculation. For simplicity, this benchmark algorithm will be referred to as integrated hereon.
Both the proposed algorithm and the integrated are implemented in RAMSES. The same
models, convergence tolerance, algebraization method (second-order BDF), and way of han-
dling the discrete events are used. For the solution of the sparse systems (the integrated
Jacobian or the reduced system of Eq. 4.12), the sparse linear solver HSL MA41 [HSL14]
is used. For the solution of the much smaller, dense injector linear systems (4.6), Intel MKL
LAPACK library is used. The matrix update criteria are as follows: for the integrated and
Config. I, all the matrices are updated every five iterations until convergence; for Configs. II
and III, the matrices of each sub-domain are updated every five iterations unless it has al-
ready converged. Finally, the convergence checks defined in Eqs. 1.16a and 1.17 are used,
with eg = e f rel = e f abs = 10−4 . Keeping the aforementioned parameters and solvers of the
simulation constant for both algorithms permits a (more) rigorous evaluation of the proposed
algorithm performance.
The main investigations will be performed on Machine 1 (see Section 2.7), but a perfor-
mance comparison with Machines 2 and 3 will be shown in Section 4.9.4.4.
4.9. EXPERIMENTAL RESULTS 77
1.1
Integrated
1.05 Config. Ι
Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
1 Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
0.95
0.9
V (pu)
0.85
0.84
0.8
0.82
0.75
0.8
0.7
130 135 140 145 150
0.65
0 40 80 120 160
time (s)
Figure 4.18: Nordic (operating point A): Voltage evolution at bus 1041
This is the smallest test system considered in this work with 750 DAEs. When the aforemen-
tioned decomposition is applied, the network sub-domain is formed including 77 buses with
154 variables and 43 injector sub-domains with the remaining variables. While the proposed
algorithm is designed to tackle large-scale systems, it will be shown that even small systems
such as this one can benefit by the algorithm, to a smaller degree. Moreover, this small
system has been frequently used for the validation of the proposed simulation methods and
localization techniques as its small size makes it easier to detect problems and verify the
dynamic response of the system.
For this system, two different operating points were considered: an N-1 insecure point A
and a secure point B. More details on these operating points can be found in [VP13]. The
disturbance of concern is a three-phase solid fault on line 4032 − 4044, near bus 4032, lasting
five cycles (i.e. 100 ms) and cleared by opening the line, which remains opened. Next, the
system is simulated over an interval of 240 s with a time-step size of one cycle. It evolves in
the long term under the effect of 22 automatic Load Tap Changers (LTCs) trying to restore the
distribution voltages as well as the synchronous generator OvereXcitation Limiters (OXLs).
The evolution of the transmission system voltage at bus 1041 is shown in Fig. 4.18, the rotor
speed of generator g15 in Fig. 4.19, and the apparent power output of the same generator
in Fig. 4.20. The curves have been computed using the Integrated method as well as the
78 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
1.012
Integrated
Config. Ι
1.01 Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Rotor speed (pu)
1.004
1.002
0 40 80 120 160
time (s)
Figure 4.19: Nordic (operating point A): Evolution of rotor speed of generator g15
Integrated
Config. Ι
1600 Config. ΙΙ
Apparent power (MVA)
1300
1200
1100
0 40 80 120 160
time (s)
Figure 4.20: Nordic (operating point A): Apparent power output of generator g15
configurations listed in Table 4.2. In response to the initial disturbance, the system undergoes
electromechanical oscillations that die out in 20 seconds. Then, the system settles at a short-
term equilibrium, until the LTCs start acting at t = 35 s. Subsequently, the voltages evolve
under the effect of LTCs and OXLs. The system is long-term voltage unstable and eventually
collapses less than three minutes after the initiating line outage. The dynamic behavior is
thoroughly examined in [VP13].
Table 4.3 shows the simulation time, the speedup (computed using Eq. 2.2), and the
maximum inaccuracy over all the bus voltages, compared to the integrated method. From
the sequential execution timings, it can be seen that Config. I performs better than the in-
tegrated, even though the two perform the same number of iterations, matrix updates, etc.
Understanding this performance difference is challenging. First, the decomposed algorithm
4.9. EXPERIMENTAL RESULTS 79
Table 4.3: Nordic (operating point A): Execution times and inaccuracy in simulation
3.5
Config. Ι
Config. ΙΙ
3 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
Speedup
2.5
1.5
1 2 4 6 8 10 12 14 16 18 20 22 24
# of cores
Figure 4.21: Nordic (operating point A): Speedup computed with Eq. 2.2
has extra OHC related to bookkeeping, which penalizes its performance. Second, the inte-
grated method solves a linear system of size 750 (shown in Eq. 4.20) at every iteration using
the sparse solver; while, for the same iteration the decomposed algorithm solves a linear
system of size 154 (two times the number of buses) and 43 smaller, dense linear systems
using the Lapack LU procedures. While the number of linear matrix updates, factorizations,
and system solutions are the same, the two approaches have different execution times, with
the latter being in most of the systems tested faster. Thus, the performance difference be-
tween the two simulations (integrated and Config. I) is a combination of these two factors
and varies from system to system and even from contingency to contingency. However, the
speedup of Config. I in all the test systems and contingencies checked ranges between 0.9
and 1.3 of the integrated.
Configurations II and III do not offer higher performance. This is due to the nature of
the test case: a collapsing scenario, where the entire system is driven to instability does not
80 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
2.4 Config. Ι
Config. ΙΙ
2.2 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
2 Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
Scalability
1.8
1.6
1.4
1.2
1
0.8
1 2 4 6 8 10 12 14 16 18 20 22 24
# of cores
Figure 4.22: Nordic (operating point A): Effective scalability computed with Eq. 2.1
allow to exploit localization. In addition, Config. III has the extra burden of computing the
EMA and standard deviation of each injector for the switching Algorithm 4.1. The sequential
performance of all systems will be further discussed in Section 4.9.4.1.
The scalability and speedup for a varying number of computational threads (up to 24)
are shown in Figs. 4.21 and 4.22, respectively. It can be seen that the efficiency of the
parallelization (see Section 2.5.1) is maximum at M = 4 while the best results are acquired
at M = 10 (these are the ones shown at the parallel execution column of Table 4.3). When
using more than ten computational threads, the OHC of creating and managing the extra
threads overtakes the incremental gain, thus the speedup declines. In general, for such
reduced in size systems, small UMA computers of up to four cores provide the best output in
terms of efficiency of parallelism.
Finally, it can be seen from Figs. 4.18-4.20 that all six curves are indiscernible. As ex-
plained in Section 4.7, it is expected that Configs. I and II give exactly the same output as the
Integrated. On the other hand, Config. III is expected to introduce some inaccuracy to the
simulation, especially for larger latency tolerance (e L ). However, as this is a collapsing sce-
nario with large dynamic activity, none of the injectors gets latent and the system response
is exactly the same as with the other configurations.
Similarly, the evolution of the transmission system voltage at bus 1041 is shown in Fig. 4.23
and the rotor speed of generator g15A in Fig. 4.24. Contrary to Operating point A, the
response at Operating point B is long-term stable. After the electromechanical oscillations
have died out, the system evolves in the long term under the effect of LTC devices acting to
restore distribution voltages. Thus, the decision about the stability of the system can only be
made after the simulation of the entire time horizon.
4.9. EXPERIMENTAL RESULTS 81
1.03
Integrated
Config. Ι
Config. ΙΙ
1.02 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
0.99
1.01
V (pu)
1 0.985
0.99 0.98
80 90 100 110 120
0.98
0 40 80 120 160 200 240
time (s)
Figure 4.23: Nordic (operating point B): Voltage evolution at bus 1041
1.0035
Integrated
1.003 Config. Ι
Config. ΙΙ
1.0025 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Rotor speed (pu)
Figure 4.24: Nordic (operating point B): Evolution of rotor speed of generator g15A
Table 4.4 shows the simulation time, the speedup and the maximum inaccuracy over all
the bus voltages compared to the integrated method. From the sequential execution timings,
it can be seen that Config. I performs better than the integrated. For this -non collapsing-
scenario, the localization techniques also offer significant speedup leading up to 1.7 faster
execution compared to the integrated.
Figure 4.25 shows the absolute error on the voltage of transmission bus 1041, i.e. the
absolute difference between the value computed with the Integrated and with Config. III.
By simulating several scenarios with different latency tolerance values, it was found that an
82 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
Table 4.4: Nordic (operating point B): Execution times and inaccuracy in simulation
0.0012
Verr (pu)
0.0008
0.0004
0
0 40 80 120 160 200 240
time (s)
Figure 4.25: Nordic (operating point B): Absolute voltage error on bus 1041
e L ≤ 0.2 MVA gives a good speedup while keeping the inaccuracy introduced at minimum.
Figure 4.26 shows the apparent power output of generator g15A computed with Config. III
for three different values of e L . In Config. III, the power plant is identified as latent (i.e. its
full model is replaced by the sensitivity model) in the time intervals shown in gray and active
in the rest. The vertical black lines show the transitions between modes. Larger values of e L
lead to earlier switching of injectors to latent. However, when latency is used, the increased
speedup comes with some inaccuracy introduced to the simulation results.
From the parallel execution timings of Table 4.4, it can be seen that Configs. I and II gain
when executed in parallel. On the other hand, the simulations with Config. III do not gain
further between the sequential and parallel execution. This is due to the increased number
of injectors becoming latent thus decreasing the amount of parallel work (see Section 4.8.1).
4.9. EXPERIMENTAL RESULTS 83
580
Config. III (ǫL=0.1 MVA, TL=10 s)
570
560
S (MVA)
550
540
530
0 40 80 120 160 200 240
time (s)
580
Config. III (ǫL=0.2 MVA, TL=10 s)
570
560
S (MVA)
550
540
530
0 40 80 120 160 200 240
time (s)
580
Config. III (ǫL=0.5 MVA, TL=10 s)
570
560
S (MVA)
550
540
530
0 40 80 120 160 200 240
time (s)
Figure 4.26: Nordic (operating point B): Apparent power output of generator g15A
84 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
1
Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA) 0.982
Config. ΙΙΙ (εL=0.2 MVA)
0.995 Config. ΙΙΙ (εL=0.5 MVA)
0.98
0.99 0.978
0.976
V (pu)
0.985
0.974
95 100 105 110 115
0.98
0.975
0.97
0 40 80 120 160 200 240
time (s)
This is the medium test system considered in this work with 35559 DAEs. The system is
decomposed into the network sub-domain, including 2565 buses, and the N = 4601 injectors.
The disturbance consists of a short circuit near bus 702 lasting six cycles (at 60 Hz), that is
4.9. EXPERIMENTAL RESULTS 85
1.01
Integrated
1.008 Config. Ι
Config. ΙΙ
1.006 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Rotor speed (pu)
1.002
1
0.998
0.996
0 40 80 120 160 200 240
time (s)
1.009
Integrated
1.008 Config. Ι
center of inertia speed (pu)
1.007 Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
1.006 Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
1.005 Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
1.004
1.003
1.002
1.001
1
0.999
0 40 80 120 160 200 240
time (s)
cleared by opening a 735-kV line. Then, the system is simulated over an interval of 240 s
with a time-step of one cycle. It evolves in the long term under the effect of 1111 LTCs, 25
Automatic Shunt Reactor Tripping (ASRT) devices, as well as synchronous generator OXLs.
Figures 4.27 and 4.28 show the voltage evolution at the bus nearest to the fault using the
Integrated method and the DDM-based with the three parameter configurations of Table 4.2.
Similarly, Figs. 4.29 and 4.30 show the rotor speed of a generator and the system center of
inertia speed, respectively. From all three figures, it can be seen that the Integrated method
and the DDM-based with Configs. I and II give exactly the same response, as discussed in
Section 4.7.
However, when latency is used (Config. III), it can be seen that the system response
is modified and some inaccuracy is observed. More specifically, with e L = 0.1 MVA, the
86 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
64
Integrated
Config. III (L =0.1 MVA, TL =10 s)
62
62.6
S (MVA)
62.4
60
62.2
62.0
58
61.8
61.6
56 75 80 85 90 95 100 105 110
0.008
Verr (pu)
0.006
0.004
0.002
0
0 40 80 120 160 200 240
time (s)
difference is almost indiscernible. With e L = 0.2 MVA, the operation of the ASRT device
(which depends on monitoring the voltage magnitude) is delayed from t = 94 s to t = 131 s.
It must be noted that, although the difference seems large at first glance, it is considered
acceptable by the Hydro-Québec (HQ) engineers. In fact, the voltage monitored by the ASRT
device evolves marginally close to the triggering threshold, and a small difference in system
trajectory is enough to postpone its action. However, the final transmission voltage reaches a
value very close to the one obtained without approximation (bringing the final voltage profile
4.9. EXPERIMENTAL RESULTS 87
Config. Ι
14 Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
12
10
Speedup
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
above some threshold values is the overall objective of the ASRT devices). The delayed
response is observed on the frequency due to load sensitivity to voltage.
An even higher latency tolerance of e L = 0.5 MVA, leads to the ASRT device not being
triggered. Although this case is less satisfactory, a discrepancy of one on the number of
tripped reactors is still considered acceptable by the HQ engineers. The reason for the
ASRT not being triggered is the one exposed above.
Figure 4.31 shows the apparent power of the generator of a hydro power plant close to the
fault location. It can be seen that the plant goes latent at t = 159 s, gets back to active mode
once for a short period of time at t = 185 s and gets latent again at t = 203 s. Figure 4.32
shows the absolute voltage error of transmission bus 702. It can be seen that the error peak
88 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
% of execution time
Config. I Config. II Config. III (e L = 0.1 MVA, TL = 10 s)
BLOCK A 13.61 8.66 6.53
BLOCK B 4.14 2.2 4.66
BLOCK C 67.47 68.25 52.06
BLOCK D 3.11 4.88 8.34
Remaining parallel 4.31 3.95 6.21
Remaining sequential 7.36 11.06 22.2
TP (100-TS ) 88.5 86.74 73.14
TS (BLOCK B+Rem. Seq.) 11.5 13.26 26.86
5
Scalability
1
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
is at t = 94 s due to the shifted ASRT action. By simulating several scenarios with different
latency tolerance values, it was found that an e L ≤ 0.2 MVA gives a good speedup while
keeping the inaccuracy introduced at an acceptable level.
Table 4.5 shows the simulation time, speedup and maximum inaccuracy over all the bus
voltages compared to the integrated method. From the sequential execution timings, it can be
seen that all configurations offer some speedup. Configuration III offers the highest speedup
in sequential execution at the cost of introducing the already mentioned error.
From the parallel execution timings of Table 4.5, it can be seen that all configurations offer
significant speedup when parallelized: up to 8.8 times without any inaccuracy and up to 11.1
times when latency is used and e L ≤ 0.2 MVA. A more detailed view is offered in Fig. 4.33,
where the speedup is shown as a function of the number of cores used.
4.9. EXPERIMENTAL RESULTS 89
Config. Ι
1 Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
0.8
Efficiency
0.6
0.4
0.2
0
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
Table 4.6 shows the time profiling of the sequential execution (M = 1), with the percent-
age of time spent in each block of the algorithm in Fig. 4.3. As expected, using Config. III
leads to lower percentage of parallel work as the DAE models of injectors are replaced by
simple sensitivity models in BLOCKS A and C. Figure 4.34 shows the scalability (computed
with Eq. 2.1) against the theoretic scalability (computed with Eq. 2.5 and TP and TS of Ta-
ble 4.6). The difference between them is due to the OHC of the implementation. In addition,
Fig. 4.35 shows the parallelization efficiency using Eq. 2.3. It can be seen that Configs. I and
II are parallelized more efficiently than Config. III. This is due to the higher percentage of
parallel work in the former and the more difficult load balancing in the latter (as the amount of
work performed by each task is unpredictable and leads to imbalances between the chunks
used). The values given in Table 4.6 should be compared carefully as they relate to different
execution times (as seen in Table 4.5).
Figure 4.36 shows the number of active injectors during the simulation with Config. III.
It can be seen that in the short-term all injectors remain active, thus the main speedup in
this period comes from the parallelization of the algorithm. In the long-term, and as the
electromechanical oscillations fade, the injectors start switching to latent and decrease the
computational burden of treating the injectors as well as the percentage of work in the parallel
sections. Hence, in this part of the simulation, the main source of acceleration is latency.
Consequently, the two sources of acceleration complement each other as they perform better
at different parts of the simulation. This is the reason why, even though Config. III has less
scalability, it is still the fastest in parallel execution (see Fig. 4.33).
Finally, Fig. 4.37 shows the real-time performance of the algorithm with Config. II. When
the wall time curve is above the real-time line, then the simulation is lagging; otherwise, the
simulation is faster than real-time and can be used for more demanding applications, like
look-ahead simulations, training simulators or hardware/software in the loop. On this power
90 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
5000
Config. III (ǫL = 0.1 MVA, TL = 10 s)
4000
Active Injectors
3000
2000
1000
0
0 40 80 120 160 200 240
time (s)
550
Real-time 5
500 Integrated ( 1-core )
Config. ΙΙ ( 1-core ) 4
450 Config. ΙΙ ( 6-cores)
Config. ΙΙ (12-cores)
400 Config. ΙΙ (24-cores) 3
350 2
Wall time (s)
300 1
250
1 2 3 4 5
200
150
100
50
0
0 40 80 120 160 200 240
Simulation time (s)
system, the algorithm performs faster than real-time when executed on 24 or more cores.
The real-time performance of all systems will be further discussed in Section 4.9.4.3.
1.021
1.024
1.0205
1.022
1.02
1.02
1.0195
1.018
V (pu)
1.019
1.016 30 40 50 60
1.014
Integrated
1.012 Config. Ι
Config. ΙΙ
1.01 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.5 MVA, TL=10 s)
1.008
0 40 80 120 160 200 240
time (s)
1.0002
Integrated
Config. Ι
Config. ΙΙ
1.0001 Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Rotor speed (pu)
0.9999
0.9998
by opening two double-circuit lines. The system is simulated over a period of 240 s with a
time-step size of one cycle (20 ms).
Figures 4.38 and 4.39 show the voltage evolution of transmission bus F0322411 and the
rotor speed of synchronous generator FRHY1689, respectively. This test case is stable in
the long-term. After the electromechanical oscillations have died out, the system evolves
under the effect of LTCs as well as OXLs. Thus, the decision about the stability of the system
can only be made after the simulation of the whole time horizon. It should be noted that
the discrepancy shown in Fig. 4.39 is insignificant (see scale) and is comparable to the
92 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
0.0012
Verr (pu)
0.0008
0.0004
0
0 40 80 120 160 200 240
time (s)
11 Config. Ι
Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
9
7
Speedup
1
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
% of execution time
Config. I Config. II Config. III (e L = 0.1 MVA, TL = 10 s)
BLOCK A 12.12 8.63 6.8
BLOCK B 6.33 6.81 13.24
BLOCK C 66.76 66.85 46.47
BLOCK D 4.4 6.09 11.34
Remaining parallel 3.68 4.46 6.46
Remaining sequential 6.71 7.16 15.69
TP (100-TS ) 86.96 86.03 71.07
TS (BLOCK B+Rem. Seq.) 13.04 13.97 28.93
significant speedup when parallelized: up to 7.1 times without any inaccuracy and up to 8.8
times when latency is used. A more detailed view is offered in Fig. 4.41, where the speedup
is shown as a function of the number of cores used for the simulation.
Table 4.8 shows the time profiling of the sequential execution, with the percentage of time
spent in each block of the algorithm in Fig. 4.3. As noted previously, using Config. III leads
to lower percentage of parallel work in BLOCKS A and C. Figure 4.42 shows the effective
scalability (computed with Eq. 2.1) against the theoretic scalability (computed with Eq. 2.5
and the timings TP and TS of Table 4.8). In addition, Fig. 4.43 shows the parallelization
efficiency using Eq. 2.3. It can be seen that Configs. I and II are parallelized more efficiently
than Config. III due to the higher percentage of parallel work.
94 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
1
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
Config. Ι
1 Config. ΙΙ
Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
Config. ΙΙΙ (εL=0.2 MVA, TL=10 s)
0.8
Efficiency
0.6
0.4
0.2
0
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of cores
1100 Real-time 25
1000 Integrated ( 1-core )
Config. ΙΙ ( 1-core ) 20
900 Config. ΙΙ ( 6-cores) 15
800 Config. ΙΙ (12-cores)
Config. ΙΙ (24-cores)
Wall time (s)
700 10
600 5
500
400 0 5 10 15 20 25
300
200
100
0
0 40 80 120 160 200 240
Simulation time (s)
6 Config. ΙΙ (12-cores)
Config. ΙΙ (24-cores)
Config. ΙΙ (30-cores)
5
Overrun (s)
0
0 20 40 60 80 100 120 140 160 180 200 220 240
Simulation time (s)
Finally, Fig. 4.44 shows the real-time performance of the algorithm with Config. II. Con-
trary to the previous test system, real-time performance is only achieved after approximately
17 s of simulation time. This means that some applications demanding “hard” real-time (at ev-
ery time instant) are not possible for this 15000-bus system. However, applications with “soft”
real-time demands (allow some limited overrun) are still possible. Figure 4.45 shows the
overrun of the simulations, that is how much they are lagging from real-time. With M = 24,
the overrun is limited to 4 s. The real-time performance of all systems will be further dis-
cussed in Section 4.9.4.3.
4.9.4 Discussion
This subsection presents a discussion concerning the sequential, parallel, and real-time per-
formance of the proposed algorithm. Moreover, the test cases above are simulated with the
UMA Machine 3 to show the performance on a standard office laptop.
In sequential execution (M = 1), the main source of speedup for the proposed algorithm are
the localization techniques presented in Section 4.6. Tables 4.3, 4.4, 4.5, and 4.7 show this
performance in the sequential execution column.
The proposed DDM has some extra OHC compared to the integrated related to the book-
keeping and management of the sub-domains, the Schur-complement computation, etc. In
addition, the memory management and the way of solving of the linear systems differ be-
tween the two algorithms. In the integrated, the entire Jacobian (see Eq. 4.20) is kept in
memory and treated by the sparse solver. On the contrary, in the DDM, the injector matrices
are kept separately in Ai , Bi , and Ci and systems (4.6) are factorized and solved with the
96 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
LAPACK dense linear solvers (DGETRF, DGETRS). These small matrices are more likely to
fit in the L1 cache memory at once, without the need of loading them in segments.
Configuration I does not use any localization techniques and follows the same Jacobian
matrix updates as the integrated; it is in fact equivalent to solving (4.20). It can be seen that
in the Nordic and HQ systems, the sequential execution of Config. I offers some speedup
compared to the integrated. That means that the OHC of the proposed DDM is positively
counteracted by the separate management and solution of the injectors. On the other hand,
in the Pegase test case, Config. I is actually 10% slower than the integrated, meaning that
the OHC of the DDM is higher.
On the contrary, Configs. II and III are always faster than the integrated as they exploit
the localization techniques described in Section 4.6. These techniques strongly rely on the
decomposed nature of the algorithm and would be very hard -if not impossible- to implement
in the integrated method. First, all states are solved in the integrated system (4.20) no matter
if some injectors have converged or not. Second, partially updating the factorized Jacobian of
(4.20) is very complex and a specialized sparse linear solver would be necessary to achieve
this. Finally, replacing the injector DAE models with linear equivalents would require to rebuild
and factorize the integrated Jacobian matrix every time.
It has been shown that the proposed DDM, with the use of localization techniques, can
provide significant acceleration even in sequential execution. This is important when exe-
cuting on legacy single-core machines, or to increase the throughput of DSA schemes by
simulating several contingencies concurrently using one core for each.
One of the main advantages of DDMs is their parallelization potential. Tables 4.3, 4.4, 4.5,
and 4.7 show the maximum speedup achieved by each configuration, while Figs. 4.21, 4.33,
and 4.41 show the speedup as a function of the number of cores.
First, it can be seen that all the configurations gain from the parallelization. The high-
est speedup is achieved by the ones using the localization techniques. Thus, Configs. II
and III already start from a large speedup in sequential execution (as seen in the previous
subsection) and reach even higher speedups in parallel.
However, the scalability of the configurations is in reverse order. That is, the configura-
tions that do not use localization techniques are parallelized more efficiently (see Figs. 4.35
and 4.43) and reach higher scalability values (see Figs. 4.34 and 4.42). The reason for this is
explained in Section 4.8.1, and can be seen in practice in Tables 4.6 and 4.8. In these tables,
the same test case was executed with Configs. I, II, and III and show that the percentage of
time spent in the parallel segments (TP ) gets lower when acceleration techniques are used.
Furthermore, Figs. 4.34 and 4.42 show that the Pegase system, even though much larger
in size, achieves the same scalability as HQ. That is because scalability is not proportional to
4.9. EXPERIMENTAL RESULTS 97
the size of the system but depends on TP . Tables 4.6 and 4.8 show that this value is almost
the same for both systems.
In practice, the proposed DDM exploits parallelization for the treatment of the injector sub-
domains. Thus, the higher the percentage of operations involving the injectors, the better the
scalability. This has two contributing factors:
1. The number of network states (which is equal to the size of the sequentially treated
reduced system of Eq. 4.12) compared to the number of injector states (which are
injector states
treated in parallel). The biggest the ratio network states , the better the expected scalability.
HQ has a ratio of 6.9 and Pegase of 4.8. Thus, one would expect HQ to have better
scalability than Pegase.
injector states
Hence, while the ratio network states provides an insight to the expected performance of the
algorithm, scalability will eventually depend on the value of TP which is not known beforehand
and depends on the types of injectors, the use of localization techniques, the contingency
simulated, the load-balancing efficiency, etc.
Finally, Figs. 4.34 and 4.42 show the theoretic and the effective scalability computed with
Eqs. 2.5 and 2.1, respectively. It can be seen that the effective scalability is always smaller
than the theoretic one and the difference between them (given by Eq. 4.30) increases for
higher number of cores. This phenomenon is explained in Sections 2.5.2 and 4.8.3. Al-
though it is possible in parallel implementations to achieve higher effective scalability that the
theoretic due to differences in memory management (for example, the program has access
to bigger total cache memory when using more cores) [Gov10], this was not observed in any
of the simulations performed with RAMSES.
Fast dynamic simulations of large-scale systems can be used for operator training and testing
global control schemes implemented in Supervisory Control and Data Acquisition (SCADA)
systems. In brief, measurements (such as the open/closed status from a switch, power flow,
voltage, current, etc.) are transferred from Remote Terminal Units (RTUs) to the SCADA cen-
ter through a communication system. These information are then visualized to the operators
which take decisions for corrective actions to be communicated back to the RTUs. In rare
applications, some remedial actions are computed automatically by closed-loop procedures.
In modern SCADA systems the refresh rate (TR ) of these measurements is 2 − 5 seconds
[GSAR09].
98 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
The simulator in these situations takes on the role of the real power system along with the
RTU measurements and the communication system. It needs to provide the simulated “mea-
surements” to the SCADA system with the same refresh rate as the real system. Thus, the
concept of “real-time” for these applications translates to time deadlines and some overruns
are acceptable.
For the Nordic test system, the execution is always faster than real-time no matter the
number of threads or the use of localization techniques. This is the case also for the HQ
system when 24 cores are used in Config. II (see Fig. 4.37). Thus, for these systems, all
possible refresh times (TR ) can be met. The Pegase test system however, exhibits some
overruns. These are shown in Fig. 4.45, where the maximum overrun is 4 s with 24 cores in
Config. II. This means that the simulator can still be used for time deadlines with TR ≥ 4 s.
Overall, a model of 8000 buses and 11000 injectors (totaling 75000 DAEs) was found to
be the limit of the proposed implementation, on Machine 1, without overruns. This limit was
found automatically by taking the Nordic system and gradually replacing its distribution loads
with a series of DN systems shown Fig. B.1. Then, three different contingencies in the TN
were simulated on the modified system while checking for overruns. The procedure stopped
when overruns were detected even when using all the available cores of Machine 1.
For the previous simulations, Machine 1 was used to allow “scanning” through a varying num-
ber of cores and show the performance of the algorithm as a function of this. However, this
algorithm can provide significant speedup even on smaller UMA standard office machines.
Thus, the previous test cases were executed on Machines 2 and 3 to show the performance
of the algorithm. The former has a dual-core while the latter a quad-core processor. The
dynamic response and the inaccuracy introduced by latency (Config. III) are not presented
as they are identical to the ones shown previously.
4.9. EXPERIMENTAL RESULTS 99
4 Machine 2 (2 cores)
Machine 3 (4 cores)
3.5
3
Overrun (s)
2.5
2
1.5
1
0.5
0
0 20 40 60 80 100 120 140 160 180 200 220 240
Simulation time (s)
Figure 4.46: HQ: Overrun of the algorithm on UMA machines with Config. II
Table 4.9 shows the execution times and speedup of simulating the test-case of Sec-
tion 4.9.2, using the two laptop computers. Machine 2 (resp. 3) with Config. II, achieves a
speedup of 1.8 (resp. 1.7) in sequential and 2.3 (resp. 3) in parallel execution, with a scala-
bility of 1.3 (resp. 1.8).
Figure 4.46 shows the overrun of the simulation on both machines in parallel execution.
It can be seen that on Machine 2 there is an overrun of 3.5 s, while in 3 the overrun is
negligible, thus, real-time simulations are possible for this real 2500-bus (modeled with 35000
DAEs) system. Configuration III with e L = 0.2 MVA offers a speedup of 2.6 (resp. 6.5) and
simulates this long-term test case in 94 s (resp. 31 s).
Similarly, Table 4.10 shows the performance information of the UMA machines for the
scenario of Section 4.9.3. With the full accurate Config. II, Machine 2 (resp. 3) achieves
a speedup of 1.3 (resp. 1.4) in sequential and 1.7 (resp. 2.4) in parallel execution, with a
100 CHAPTER 4. PARALLEL SCHUR-COMPLEMENT-BASED DDM
scalability of 1.3 (resp. 1.7). When the latency technique with e L = 0.2 MVA is used, a
speedup of 4.3 (resp. 5.5) is achieved and the test case is simulated in 137 s (resp. 94 s).
Overall, in this section it can be seen that the proposed algorithm provides significant
speedup even on normal laptop computers. It allows to perform fast and accurate power
system dynamic studies without the need of expensive equipment.
4.10 Summary
In this chapter, a parallel DDM-based algorithm has been proposed for the dynamic simula-
tion of power systems. The algorithm partitions the network from the injectors attached to it,
providing a star-shaped decomposition. The sub-domains formulated are treated indepen-
dently and in parallel while their interface variables are updated using a Schur-complement
approach. Furthermore, three localization techniques were presented to accelerate the sim-
ulation both in sequential and parallel execution.
First, the mathematical formulation of the proposed algorithm and localization techniques
has been detailed. Next, a comparison of the proposed algorithm to the simultaneous ap-
proach was presented for the investigation of its convergence properties and how these are
affected by the use of localization techniques. Then, the parallelization procedure was pre-
sented using the semantics of Chapter 2 and analyzing how the localization techniques and
the OHC can affect the performance metrics. Subsequently, three test cases were presented
involving a small, a medium and a large-scale test system, respectively. The accuracy and
performance of the algorithm on all three cases was assessed compared to a simultaneous
approach, implemented in the same software. Finally, a comparative overall assessment of
the algorithms sequential, parallel, and real-time behavior was presented.
C HAPTER 5
Parallel two-level
Schur-complement-based
decomposition method
5.1 Introduction
The most noticeable developments foreseen in power systems involve Distribution Networks
(DNs). Future DNs are expected to host a big percentage of renewable energy sources. The
resulting challenge in power system dynamic simulations is to accurately model DNs and
their participation to the bulk system dynamic behavior. This becomes compulsory as DNs
are called upon to actively support the Transmission Network (TN) with an increasing number
of Distributed Generators (DGs) and flexible loads participating in ancillary services through
Smart Grid technologies.
In present-day dynamic security assessment of large-scale power systems, it is common
to represent the bulk generation and higher voltage (transmission) levels accurately, while
the lower voltage (distribution) levels are equivalenced. On the other hand, when studying
the response of DNs, the TN is often represented by a Thévenin equivalent. The prime
motivation behind this practice has been the lack of computational resources. Indeed, fully
representing the entire power system network was historically impossible given the available
computing equipment (memory capacity, processing speed, etc.) [KRF92]. Even with current
computational resources, handling the entire, detailed model with hundreds of thousands of
Differential and Algebraic Equations (DAEs) is extremely challenging [KRF92, GWA11].
As modern DNs are evolving with power electronic interfaces, DGs, active loads, and
control schemes, more detailed and complex dynamic equivalent models would be needed to
encompass the dynamics of DNs and their impact on the global system dynamics. A dynamic
equivalent of a power system is a low-order dynamic model of the system, which is usually
obtained by the reduction of a given full model [MMR10]. Some equivalencing approaches
101
102 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
reported in the literature are modal methods, synchrony (or coherency) methods [MMR10],
and measurement or simulation-based methods [ANL+ 12]. Nevertheless, equivalent models
inadvertently suffer from a number of drawbacks:
• The identity of the replaced system is lost. Faults that happen inside the DNs them-
selves cannot be simulated and individual voltages at internal buses, currents, con-
trollers, etc. cannot be observed anymore. This makes it difficult to simulate controls
or protections that rely on these values (e.g. fault ride through tripping of DGs).
• Most equivalent models target a specific type of dynamics (short or long-term, elec-
tromechanical oscillations, voltage recovery, etc.) and fail when used for other types.
Hence, different types of simulations require different models. This adds an additional
burden of maintaining and updating power system models when system take place.
• In most cases, the use or not of these equivalent models is decided off-line, when it is
still unknown whether and how the contingency simulated will affect the DNs.
In this chapter, a parallel two-level Schur-complement-based DDM is proposed for the dy-
namic simulation of combined Transmission and Distribution (T&D) systems. First, the algo-
rithm decomposes the combined system on the boundary between the TN and the DNs. This
leads to the creation of several sub-domains, each defined by its own network and injectors.
Then, a second decomposition scheme is applied within each sub-domain, splitting the net-
work from the injectors, in a similar way to the single-level algorithm of Chapter 4. Finally,
the solution of the sub-domain DAE systems is performed hierarchically with the interface
variables being updated using a Schur-complement approach at each decomposition level.
It will be shown that this algorithm can also be applied to sub-transmission networks, as long
as some part of them is radial.
The proposed algorithm augments the performance of the simulation in two ways. First,
the independent calculations of the sub-systems (on both decomposition levels) are par-
allelized providing computational acceleration. Second, the three localization techniques
described in Section 4.6 are employed on both decomposition levels to avoid unnecessary
computations and provide numerical acceleration.
The algorithm is first presented with a certain level of abstraction, focusing on its mathe-
matical formulation. Next, the details concerning its implementation using the shared mem-
ory parallel computing model are presented. Finally, some results are shown using the ex-
panded Nordic and Hydro-Québec systems presented in Section 1.3.
5.2. POWER SYSTEM DECOMPOSITION 103
M
M
Central sub-domain M
line
M or
transformer
M
Injectors
M
Although this algorithm is not computationally efficient (it has a complexity of O( R!),
where R is the number of buses), it must only be executed once for each network. There
is no need to revise the decomposition when topological changes are applied to the net-
work, unless the changes destroy the star-shaped decomposition (for example a new line or
transformer is put in service between two Satellite sub-domains).
Either from the electrical topology (voltage levels and distribution transformers) or with
the use of Algorithm 5.1, the power system sketched in Fig. 5.1 is partitioned into the Central
and L Satellite sub-domains, along with their injectors. This decomposition is reflected on
the system of DAEs (1.1) as follows.
First, the DAE system describing the Central sub-domain with its injectors becomes:
where NC is the number of injectors attached on the Central sub-domain network, xCj and
ΓCj are the projections of xC and ΓC (defined in Eq. 5.1) on the j-th injector. Thus, xC =
[xC1 . . . xCNC ]T and ΓC = diag[ΓC1 , . . . , ΓCNC ].
5.3. SUB-SYSTEM SOLUTION 105
0 = gC (xC , VC , VSt )
(5.4)
ΓCj ẋCj = ΦCj (xCj , VC ), j = 1, ..., NC
where CCj is a trivial matrix with zeros and ones whose purpose is to extract the injector
current components from xCj . Similarly, for the Si sub-domain:
NSi
ISi = ∑ CSij xSij (5.7)
j =1
As in the decomposition of Chapter 4, the network and injector systems are coupled
through the currents injected into the network and the voltages of the buses where the injec-
tors are attached to.
At each discrete time instant, the nonlinear, discretized injector equations are solved
simultaneously with the linear network equations using a Newton method. Thus, at the k-th
Newton iteration, the following systems are solved:
NC L
DCk ∆VCk − ∑ CCj ∆xCjk
− ∑ ESik
∆VSik = − gC (xCk−1 , VCk−1 , VStk−1 ) (5.9)
j =1 i =1 | {z }
k
| {z } gC
∆ICk
k −1
k
ACj ∆xCj
k k
+ BCj ∆VCk = − f Cj (xCj , VCk−1 ), j = 1, ..., NC (5.10)
| {z }
k
f Cj
106 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
Central Sub-domain
Network
Satellite
Sub-domain
Network
Injectors
Step 1 Sub-domain reduced systems formulation
2 Global reduced system formulation
3 substitution
4 substitution
Figure 5.2: Hierarchical solution of two-level decomposed algorithm (four steps)
and (i = 1, . . . , L):
NSi
k
DSi ∆VSik − ∑ CSij ∆xkSij +FSik ∆VCk = − gSi (xkSi−1 , VSik−1 , VCti
k −1
) (5.11)
j =1 | {z }
k
| {z } gSi
∆ISi
k
−1
AkSij ∆xkSij + BSij
k
∆VSik = − f Sij (xkSij , VSik−1 ), j = 1, ..., NSi (5.12)
| {z }
f kSij
where ACj
k (resp. Ak ) is the Jacobian matrix of the j-th injector towards its own states and
Sij
k (resp. B k ) towards the voltages of its sub-domain. Finally, E k is the Jacobian of the
BCj Sij Si
Central sub-domain towards the voltages of the Si sub-domain and FSik is the Jacobian of the
Si sub-domain towards the voltage of the Central sub-domain.
The decomposed system results in L + 1 + NC + ∑iL=1 NSi linear systems (5.9)-(5.12) to
be solved at each Newton iteration to compute the vectors x(tn ) and V (tn ). In the proposed
algorithm, the solution is performed in an hierarchical manner as sketched in Fig. 5.2, using
a Schur-complement approach at each iteration to treat the interface variables between sub-
domains. This procedure is summarized below.
First, the sub-domain reduced systems are formulated by eliminating the injector states
(xCj or xSij ) from the sub-domain network equations. This leads to reduced systems that
involve only the sub-domain voltage states (VC and VSi ). This procedure is the same as
described in Section 4.4 of the single-level algorithm. Twoports are treated as described in
5.3. SUB-SYSTEM SOLUTION 107
the previous chapter with the limitation that they cannot be connected between two different
Satellite sub-domains, as this destroys the star-shaped partition layout.
Second, the global reduced system is obtained by eliminating the Satellite sub-domain
voltage states (VSt ) from the Central reduced system. This leads to a global reduced system
that involves only the voltage states of the Central sub-domain (VC ).
Then, the latter is solved and the computed Central sub-domain voltages (VC ) are back-
substituted into the sub-domain reduced systems. This decouples the solution of these sys-
tems which now involve only their own sub-domain voltage states (VSi ). Thus, they can be
solved independently.
Similarly, the sub-domain voltage states (VC and VSi ) are back-substituted into the injector
equations, thus decoupling their solution as they now involve only their local states (xCj or
xSij ). Hence, their solution can be also performed independently.
These steps are detailed in the following sub-sections.
The sub-domain reduced systems are formulated by eliminating the injector states (xCj or
xSij ) from the sub-domain network equation systems (5.9) and (5.11). This leads to reduced
systems that involve only the sub-domain voltage states (VC and VSi ):
!
NC −1 L NC −1
DCk + ∑ CCj k
ACj k
BCj ∆VCk −∑ k
ESi ∆VSik = −gCk − ∑ CCj ACj
k k
fCj
j =1 i =1 j =1
L
e Ck ∆VCk − ∑ ESi
⇐⇒ D k
∆VSik = −geCk (5.13)
i =1
and (i = 1, . . . , L):
!
NSi −1 NSi −1
k
DSi + ∑ CSij AkSij k
BSij ∆VSik + FSik ∆VCk = −gSi
k
− ∑ CSij AkSij k
fSij
j =1 j =1
⇐⇒ D k
e Si ∆VSik + FSik ∆VCk = −geSi
k
(5.14)
As discussed in the previous chapter, the nonzero structures of the correction terms
−1 −1
k
CCj ACj k and C
BCj A k k depends on whether the component is an injector
BSij
Sij Sij
(attached to one bus) or a twoport (attached to two buses). Hence, the reduced system
matrices D Si
e k ) exhibit the sparsity pattern of D k (resp. D k ) with some fill-in terms
e k (resp. D
C Si C
introduced by twoports.
Likewise, the global reduced system is formulated by eliminating the Satellite sub-domain
voltage states (VSi ) from the Central reduced system equations (5.13). This leads to a global
108 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
reduced system that involves only the voltage states of the Central sub-domain (VC ):
!
L −1 L −1
De Ck + ∑ ESik
D k
e Si FSik ∆VCk = −geCk − ∑ ESi
k
D k
e Si k
geSi
i =1 i =1
The global reduced system matrix D̄Ck maintains the sparsity pattern of D
e k as the elim-
C
ination procedure is similar to that of an injector as described previously (attached to one
−1 −1
bus). Moreover, the computations of E k D ek
Si F k and E k D
Si
ek
Si gek are efficient as
Si Si Si
the matrices k
ESi and FSik are extremely sparse, given that each Satellite sub-domain is at-
tached only to one bus in the Central sub-domain.
eg = 0.001, the latter generator will be solved at a smaller relative accuracy. To deal with this
problem, a smaller eg has to be selected for all injectors to ensure the accurate solution of the
smaller ones. In this way, however, the larger generator will be solved at a higher accuracy
leading to larger computational burden due to more iterations.
Decomposing a T&D system as detailed in the previous sections of this chapter allows
to select different Sbase values in the various sub-domains. Thus, a smaller Sbase can be used
in the Satellite sub-domains (DNs) and a larger in the Central sub-domain (TN). In the same
example as above, if SbaseC = 100 MVA is used for the TN and SbaseS = 1 MVA is for the DNs,
the current output of both injectors will be at 1 pu, thus avoiding the previous problems.
Remaining
BLOCK A
BLOCK F
BLOCK B
BLOCK E
BLOCK D
BLOCK C
The idea is similar to the one presented in Section 4.6.1, but applied at both decomposition
levels. It is used within one discretized time instant solution to stop computations of injector or
sub-domain reduced systems whose DAE models have already been solved with the desired
5.5. LOCALIZATION TECHNIQUES 111
Remaining
BLOCK A
BLOCK F
BLOCK E
BLOCK B
BLOCK D
BLOCK C
tolerance. That is, after one decomposed solution, the convergence of each injector and
sub-domain reduced system is checked individually (BLOCK F ). If the convergence criterion
is satisfied, then the specific sub-system is flagged as converged and for the remaining iter-
k , fk ,
ations of the current time instant it is not solved. Nevertheless, its mismatch vector (fCj Sij
k , or ḡ k ) is monitored to guarantee that it remains converged. This technique decreases
geSi C
the computational effort within one discretized time instant without affecting the accuracy of
the solution.
Next, taking advantage of the fact that each sub-domain is solved using the VDHN presented
in Section 1.2.5.2, the sub-system update criteria are decoupled and their matrices (such as
112 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
C
SCj SSi
line
or
transformer M
SSij
Figure 5.5: Latency applied at both decomposition levels (S denotes apparent power)
DC , DSi , ACj , ASij , BCj , BSij , etc.), as well as their Schur-complement contributions and
the sub-domain reduced systems, are updated asynchronously. In this way, sub-domains at
both decomposition levels, which converge fast keep the same matrices for many iterations
and even time-steps, while sub-domains which converge slower update their matrices more
frequently. Thus, BLOCKS A and B of Fig. 5.3 are replaced by those in Fig. 5.4.
The same update criteria as in Chapter 4 are used. That is, if a sub-system has not
converged after five iterations of the algorithm presented in Fig. 5.3, its matrices and Schur-
complement terms are updated. Moreover, an update of the matrices is triggered when a
change in the equations of the sub-system is detected. Of course, after a severe event in
the system (such as a short-circuit, the tripping of a branch or a generator, etc.) or when
the time step size used for the discretization is changed, an update of all the matrices, the
Schur-complement terms and the reduced systems is forced to avoid convergence problems.
5.5.3 Latency
The use of latency to decrease the computational burden has been investigated in Sec-
tion 4.6.3. This technique can be extended and applied at both levels of the hierarchical
scheme detailed in this chapter. Figure 5.5 shows the i-th Satellite sub-domain with its injec-
tors. First, as in the single-level algorithm, the standard deviation of the apparent power of
each injector (SCj or SSij ) can be used to declare it latent and replace its dynamic model with
the sensitivity-based one of Eq. 4.14. Second, the standard deviation of the apparent power
SSi exchanged between the Central and the i-th Satellite sub-domain can be used to declare
the entire sub-domain (including its injectors) as latent.
The use of latency on the injectors has been thoroughly analyzed in the previous chapter.
For the Satellite sub-domains, the same metrics (4.15)-(4.19) and switching procedure of Al-
gorithm 4.1 can be used based on SSi . The sensitivity model used for equivalencing the Satel-
lite sub-domain is developed similarly to (4.14), but starting from the Satellite sub-domain re-
5.6. EFFECTS OF LOCALIZATION TECHNIQUES ON CONVERGENCE 113
duced system (5.14). Hence, ignoring the internal dynamics, that is geSi (xkSi−1 , VSik−1 , VCti
k −1
)'
0, and solving for the sub-domain voltage variation ∆VSi :
−1
∆VSik ' − D k
e Si FSik ∆VCk (5.16)
where HSi is a trivial matrix with zeros and ones whose purpose is to extract the terminal
voltage variation from ∆VSik , and GSi is the sensitivity matrix relating the Satellite with the
Central sub-domain voltage variation.
Selecting an arbitrary switching instant t∗ , the linear relation (4.13) can be rewritten as:
VSti (tn ) = VSti (t∗ ) − GSi (t∗ ) [VC (tn ) − VC (t∗ )] (5.18)
k k k
AS11 · · · BS11 · · · 0
0 0 0 0 4xS11 fS11
0 AS12 · · · BS12 · · · 0 4xS12 fS12
0 0 0
.. .. . . .. .. .. .. .. .. ..
. . . . ··· . . . . . .
−C −C · · · D · · · 0 0 0 F 4V g
S11 S12 S1 S1 S1 S1
. .. .. .. . . . .. .. .. .
.. . .. .. = − ...
. . . . . . (5.19)
0
0 0 0 · · · AC1 0 · · · BC1
4xC1
fC1
0
0 0 0 · · · 0 A C2 · · · B
C2
4x
C2
f
C2
. .. .. .. .. .. . . .. . .
. . . . .
. . . . · · · . . . .
0 0 0 −ES1 · · · −CC1 −CC2 · · · DC 4 VC gC
| {z } | {z } | {z }
Jk 4y k Fk
This system is equivalent to (1.15), presented in Section 1.2.5.2, when using the same
discretization scheme and performing some row and column permutations. That is, the DDM-
based solution presented above is mathematically equivalent to solving the set of nonlinear,
discretized equations (5.9)-(5.12) with a quasi-Newton method. Thus, the observations of
Section 4.7 about the convergence of the algorithm hold true in this case as well.
Similarly to the previous algorithm, the localization techniques modify the equivalent sys-
tem. First, the matrices of each injector and of the network are not updated synchronously
but according to the local convergence of each sub-system (see Section 5.5.2). Thus, the
system is modified into:
k k
Aks11 k s11
0 · · · BS11 ··· 0 0 0 0 4xS11 fS11
S11
AkS12 k s12
0
s12
· · · BS12 ··· 0 0 0 0
4x
S12
f
S12
. .. .. . .. .. .. ..
. .
. . ..
. . ··· . . . . . .
. .
−C ks11 −C ks12 · · · D ks1 k s1
··· 0
S11 S12 S1 0 0 FS1 4V
S1
g
S1
. .. .. ..
.. . .. .. .. .
.. . . . . .. . . . .. = − ... (5.20)
k C1 k C1
· · · AC1 0 · · · BC1
0
0 0 0 4xC1 fC1
k C2 k C2
· · · 0 AC2 · · · BC2
0 0 0 0 4xC2
fC2
.. . .. .. .. .. . . ..
.. .. ..
.
. .
··· . . . . . .
k k k kC
0 0 0 ES1 · · · CC1 CC2 · · · DC
− C − C1 − C2
4 VC gC
where k Sij ≤ k (i = 1, ..., L, j = 1, ..., NSi ) and k Cj ≤ k (j = 1, ..., NC ) is the iteration at which
the j-th injector of the sub-domains was last updated. Similarly, k Si ≤ k (i = 1, ..., L) and
k C ≤ k is the iteration at which the i-th Satellite or Central sub-domain network matrices
were updated. These matrices are often kept constant over several iterations or even time
steps. Therefore, this modified system can be treated as a quasi-Newton method with a
special Jacobian update scheme. The error introduced to the Jacobian by the asynchronous
5.6. EFFECTS OF LOCALIZATION TECHNIQUES ON CONVERGENCE 115
update of the sub-domain matrices is minimal, and while it can affect the convergence rate
of the Jacobian, it does not affect the final solution.
Next, by also considering the skip-converged and latency techniques (see Sections 5.5.1
and 5.5.3) the system is modified to:
k k k
AkS11 k s11
s11
0 · · · BS11 ··· 0 0 0 0 4xS11 fS11 rS11
AkS12 k s12
0
s12
· · · BS12 ··· 0 0 0 0 4x
S12
f
S12
r
S12
. .. .. .. .. .. .. ..
. . .
. . . ··· . . . .
. . . . . . . .
−C ks11 −C ks12 · · · D ks1 · · · 0 k s1
S11 S12 S1 0 0 FS1 4V
S1
g
S1
r
S1
. .. .. .. . . .. .. .. ..
.
.. = − ... + ...
.. . . . . . . . .
k C1 k C1
0 0 · · · AC1 0 · · · BC1
0
0 4xC1 fC1 rC1
k C2 k C2
· · · · · ·
0 0 0 0 0 A B 4xC2
C2 C2
fC2 rC2
.. .. .. .. .. .. . . ..
.. .. ..
.
. . . ··· . . . . . . .
kC k C1 − k C2
0 0 0 −ES1 · · · −CC1 CC2 · · · DCkC 4 VC gC rC
(5.21)
where:
Cj if the j-th injector of a sub-domain is latent or converged
f , f
Sij
r Sij , rCj =
0 otherwise
0 if the j-th injector of a sub-domain is converged
B Sij , BCj =
Cj otherwise
B , B
Sij
Si if the i-th Satellite sub-domain is latent or converged
g
r Si =
0 otherwise
0 if the i-th Satellite sub-domain is converged
F Si =
F otherwise
Si
0 if the i-th Satellite sub-domain is converged
C Sij =
C otherwise
Sij
rk
<η<1 ∀k ≥ 0 (5.22)
kF (y k )k
can be easily checked to ensure the correctness of the decomposed algorithm (see Ap-
pendix A).
116 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
Figure 5.6: Fork-join pattern of algorithm in Fig. 5.3 with localization techniques
On the contrary, latency can introduce some inaccuracies in the simulation response
which will be further analyzed from simulation results in Section 5.8.
0.5
DG might disconnect
(second-order BDF), and way of handling the discrete events are used. For the solution
of the sparse systems (the integrated Jacobian or the reduced systems of Eqs. 5.14 and
5.15), the sparse linear solver HSL MA41 [HSL14] was used. For the solution of the much
smaller, dense injector linear systems (5.10) and (5.12), Intel MKL LAPACK library was
used. The matrix update criteria are as follows: for the integrated and Config. I, all the
matrices are updated every five iterations until convergence; for Configs. II and III, the ma-
trices of each sub-domain are updated every five iterations unless convergence has already
taken place. Finally, the convergence checks defined in Eqs. 1.16a and 1.17 are used, with
eg = e f rel = e f abs = 10−4 . Keeping the aforementioned parameters and solvers of the sim-
ulation constant for both algorithms permits a (more) rigorous evaluation of the proposed
algorithm performance.
The main investigations have been performed on Machine 1 (see Section 2.7), but a
performance comparison with Machines 2 and 3 is given in Section 5.8.4.3. To better under-
stand the behavior of the algorithm, some time profilings of the simulations are given in the
following sections, while a numerical profiling is presented in Appendix D.
1.2
1.1 Scenario 1a
1
Scenario 1b
V (pu)
0.9
Scenario1a: Integrated
0.8 Scenario1a: Config. Ι
Scenario1a: Config. ΙΙ
0.7 Scenario1b: Integrated
Scenario1b: Config. Ι
0.6 Scenario1b: Config. ΙΙ
0 1 2 3 4 5 6 7 8 9 10
time (s)
2250
Scenario 1a (Config. ΙΙ)
2100 Scenario 1b (Config. ΙΙ)
1950
1800
1650
P (MW)
1500
1350
1200
1050
900
750
0 2 4 6 8 10
time (s)
Figure 5.9: Nordic variant 1, Scenario 1: Total active power generated by DGs in all DNs
chines, 438 PVs, 730 WTs, and 19419 dynamically modeled loads. The resulting un-decom-
posed model has 143462 differential-algebraic states. A more detailed system description
can be found in Section 1.3 and its one-line diagram in Fig. B.3.
The first decomposition is performed on the boundary between TN and DNs, thus creating
the Central sub-domain with the TN and L = 146 Satellite sub-domains with the DNs. Next,
each sub-domain is decomposed to its network and injectors. More specifically, NS1 = NS2 =
... = 141 and NC = 24.
Two scenarios, respectively short and long-term, are considered for this system. Further-
more, each scenario is simulated twice. In the first simulation (referred to as Scenarios 1a
and 2a), the DGs comply with the Low Voltage and Fault Ride Through (LVFRT) requirements
sketched in Fig. 5.7, taken from [Sch08]. In the second simulation (referred to as Scenar-
5.8. EXPERIMENTAL RESULTS 119
0.9
0.7
1
V (pu)
0.8
0.5
0.6
0.3 0.4
0.2
LVFRT curve
0.1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
0 1 2 3 4 5 6 7
time (s)
ios 1b and 2b), the DGs remain connected to the system even if their terminal voltages fall
below the limit shown in Fig. 5.7.
The nominal powers of the TN generators are hundreds of MVA while of the DN DGs are
just a few MVA. Thus, as described in Section 5.3.4, different base powers are considered
for the Central (SbaseC = 100 MVA) and Satellite (SbaseS = 2 MVA) sub-domains. For the
integrated method, the smallest one is used for the entire system to ensure the accurate
simulation of DN DGs. Moreover, different latency tolerance is used for injectors connected
to the TN (e LC ) and to a DN (e LS ), as listed in Table 5.1.
Scenario1a: Scalability
Scenario1b: Scalability
20 Scenario1a: Speedup
Scenario1b: Speedup
15
10
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
Figure 5.11: Nordic variant 1, Scenario 1: Speedup and scalability computed with Config. II
term stable in both scenarios. However, the final values in Scenario 1b are higher than in 1a.
This can be explained from Fig. 5.9, which shows the total active power generated by DGs in
all DNs. The disconnection of the DGs in accordance with the LVFRT curves leads to losing
approximately 150 MW of distributed generation. This power deficit is covered by importing
more power from the TN, thus leading to depressed TN voltages.
Furthermore, Fig. 5.10 shows the voltage at buses E22 and R09 (see Fig. B.1) in two
different DNs connected to TN buses 1041 and 4043, respectively. DGs are connected to
buses E22 and R09 and the voltage evolution is compared to the LVFRT curve of Fig. 5.7 to
decide whether they will remain connected or not. It can be seen that the DGs in the DN
1041a disconnect at t ≈ 1.2 s, while those in DN 4043a remain connected. This type of
protection schemes (LVFRT) rely on the voltage evolution at DN buses, thus illustrating the
necessity for detailed combined dynamic simulations when the DG penetration level becomes
significant.
The speedup and scalability for both Scenarios 1a and 1b are shown in Fig. 5.11. Initially,
the two-level DDM-based algorithm executed on a single core (M = 1) performs similarly
to the integrated. When using more computational cores, the proposed algorithm offers a
speedup of up to 22 times and the system is simulated in approximately 16 s. The summary
of the speedups achieved is given in Tables 5.2 and 5.3.
As regards the scalability of the algorithm, Fig. 5.11 shows that the DDM-based parallel
algorithm executes up to 15 times faster in parallel compared to its own sequential execution.
Scenario 1a scales slightly better than 1b due to the higher dynamic activity caused by the
LVFRT tripping, which disturbs the system and initiates more matrix updates. From the same
figure, it can be seen that the parallel algorithm is more efficient in the range of up to 24
5.8. EXPERIMENTAL RESULTS 121
1.012
1.08
1.01
1.06 1.008
1.006
1.04
1.004
1.02 1.002
V (pu)
70 72 74 76 78 80
1
0.98
0.96 Integrated
Config. Ι
Config. ΙΙ
0.94 Config. ΙΙΙa
Config. ΙΙΙb
Config. ΙΙΙc
0.92
0 40 80 120 160
time (s)
Figure 5.12: Nordic variant 1, Scenario 2a: Voltage evolution at TN bus 4044
0.98
Integrated
0.96 Config. Ι
Config. ΙΙ
Config. ΙΙΙa
0.94 Config. ΙΙΙb
Config. ΙΙΙc
0.92
V (pu)
0.9
0.88
0.86
0.84
0 40 80 120 160
time (s)
Figure 5.13: Nordic variant 1, Scenario 2a: Voltage evolution at DN bus 1041a − E22
1950
Scenario 2a
Scenario 2b
1800
1650
P (MW)
1500
1350
1200
1050
Figure 5.14: Nordic variant 1, Scenario 2: Total active power generation by DGs (Config. II)
Table 5.4: Nordic variant 1, Scenario 2a: Execution times and inaccuracy of simulation
22000
Config. IIIa
20000
18000
Active Injectors
16000
14000
12000
10000
8000
6000
0 40 80 120 160 200 240
time (s)
150
Config. IIIa
145
Active DNs
140
135
130
Config. Ι
Config. ΙΙ
20 Config. ΙΙΙa
15
Speedup
10
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
Figure 5.17: Nordic variant 1, Scenario 2a: Speedup computed with Eq. 2.2
explained in Section 4.9.1.1, collapsing scenarios do not exhibit low enough dynamic activity
for latency to provide large acceleration.
The performance benefits obtained by latency in this scenario are due to the long-term
nature of the voltage collapse allowing the injectors and DNs to become latent between suc-
cessive LTC actions. Figures 5.15 and 5.16 show respectively the number of active injectors
and DNs throughout the simulation for Config. IIIa. It can be seen that very few DNs become
latent, and towards the end of the simulation all injectors and DNs become active.
From the parallel execution timings in Table 5.4, a speedup of 18.3 times can be obtained
while retaining full accuracy of the simulation. A more detailed view is offered in Fig. 5.17,
where the speedup is shown as a function of the number of cores used. Furthermore, it can
5.8. EXPERIMENTAL RESULTS 125
Table 5.5: Nordic variant 1, Scenario 2a: Time profiling of sequential execution (M = 1)
% of execution time
Config. I Config. II Config. IIIa
BLOCK A 20 6.49 7.38
BLOCK B 4.6 1.58 4.01
BLOCK C 0.03 0.03 0.04
BLOCK D 3.96 4.32 4.84
BLOCK E 59.73 71.61 63.87
BLOCK F 3.24 6 7.18
Remaining parallel 2.34 3.32 5.08
Remaining sequential 6.1 6.65 7.6
TP (100-TS ) 93.87 93.32 92.36
TS (BLOCK C+Rem. Seq.) 6.13 6.68 7.64
25 Config. ΙΙ: Theoretic scalability
Config. ΙΙ: Effective scalability
Config. ΙΙΙa: Theoretic scalability
20 Config. ΙΙΙa: Effective scalability
Scalability
15
10
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
be seen that in parallel execution, Config. IIIa offers no further gain compared to Config. II.
Even more, with M ≥ 40 cores, Config. II becomes faster than IIIa.
Table 5.5 shows the time profiling of the sequential execution (M = 1), with the percent-
age of time spent in each block of the algorithm in Fig. 5.3. As expected, using Config. III
leads to lower percentage of parallel work. Using the timings of TP and TS , Fig. 5.18 shows
the effective scalability (computed with Eq. 2.1) against the theoretic scalability (computed
with Eq. 2.5). The difference between them is due to the OHC of the implementation. From
Figs. 5.17 and 5.18 it can be seen that the efficiency of the parallelization is very good until
M = 16, while for higher number of threads the incremental gain is smaller. Nevertheless,
the best results are acquired with M = 44 and are the ones shown in the parallel execution
column of Table 5.4.
Contrary to Scenario 2a, Scenario 2b is long-term stable. In this (very optimistic) sce-
nario, the DGs remain connected to the DNs throughout the simulation, thus supporting the
system and avoiding the voltage collapse. Figures 5.19 and 5.20 show the voltage evolution
126 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
at a TN and DN bus, respectively. The jumps in Fig. 5.20 correspond to LTC moves in the
DN of concern (the shown voltage is not the one controlled by the LTC though).
5.8. EXPERIMENTAL RESULTS 127
1.08
Integrated
Config. Ι
1.07 Config. ΙΙ
Config. ΙΙΙa
Config. ΙΙΙb
1.06 Config. ΙΙΙc
V (pu)
1.05
1.04
1.03
0 40 80 120 160 200 240
time (s)
Figure 5.19: Nordic variant 1, Scenario 2b: Voltage evolution at TN bus 4044
0.96
Integrated
0.955 Config. Ι
Config. ΙΙ
0.95 Config. ΙΙΙa
Config. ΙΙΙb
0.945 Config. ΙΙΙc
V (pu)
0.94
0.935
0.93
0.925
0.92
0 40 80 120 160 200 240
time (s)
Figure 5.20: Nordic variant 1, Scenario 2b: Voltage evolution at DN bus 1041a − E22
0.005
Config. ΙΙΙa
Config. ΙΙΙb
0.004 Config. ΙΙΙc
0.003
Verr (pu)
0.002
0.001
0
0 40 80 120 160 200 240
time (s)
Figure 5.21: Nordic variant 1, Scenario 2b: Absolute voltage error on voltage at TN bus 4044
128 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
0.008
Verr (pu)
0.006
0.004
0.002
0
0 40 80 120 160 200 240
time (s)
Figure 5.22: Nordic variant 1, Scenario 2b: Absolute voltage error on voltage at DN bus
1041a − E22
Table 5.6: Nordic variant 1, Scenario 2b: Execution times and inaccuracy of simulation
Their corresponding voltage errors are shown in Figs. 5.21 and 5.22. It can be seen
that the main source of inaccuracy is the shift of LTC actions in the system. However, even
more than with the ASRT events in the HQ system (see Section 4.9.2), such delays are not
considered significant as long as the final voltages are acceptable.
Table 5.6 shows the simulation time, speedup, and maximum inaccuracy over all the bus
voltages compared to the integrated method. From the sequential execution timings, it can
be seen that Config. I is slower than the integrated as in Scenario 2a. Configuration II offers
some small speedup, while Config. III provides a higher speedup in sequential execution
while also introducing some error.
Figure 5.23 shows the number of active injectors and DNs throughout the simulation for
Config. IIIa. It can be seen that in the short-term, until the post-fault electromechanical
oscillations die out, the injectors and DNs remain active. On the other hand, in the long-term
many injectors and DNs exhibit low dynamic activity, and towards the end of the simulation
almost all of them become latent. This verifies the observation that in the short-term, when
5.8. EXPERIMENTAL RESULTS 129
15000
Active Injectors
10000
5000
0
0 40 80 120 160 200 240
time (s)
100
Active DNs
80
60
40
20
0
0 40 80 120 160 200 240
time (s)
Figure 5.23: Nordic variant 1, Scenario 2b: Number of active injectors and DNs
there is high dynamic activity in the system with frequent matrix updates, parallelization is
the main source of speedup. In the long-term, when dynamics with larger time constants
dominant, latency is the main source of speedup.
From the parallel execution timings of Table 5.6, it can be seen that all configurations offer
significant speedup when parallelized: up to 15.2 times without any inaccuracy and up to 28.9
times when latency is used and e L ≤ 0.2 MVA.
Table 5.7 shows the time profiling of the sequential execution (M = 1), with the percent-
age of time spent in each block of the algorithm in Fig. 5.3. As expected, using Config. III
leads to an overall lower percentage of parallel work. Figure 5.24 shows the speedup as a
function of the number of cores used; while, Fig. 5.25 shows the theoretic (using the timings
of TP and TS ) and the effective scalability. From these figures, it can be seen that the effi-
ciency of the parallelization is very good until M = 24, while for a higher number of threads
the incremental gain is small.
130 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
Table 5.7: Nordic variant 1, Scenario 2b: Time profiling of sequential execution (M = 1)
% of execution time
Config. I Config. II Config. IIIa
BLOCK A 12.35 5.38 8.39
BLOCK B 2.28 3.41 3.18
BLOCK C 0.03 0.02 0.05
BLOCK D 6.2 4.96 4.59
BLOCK E 69.21 72.7 62.78
BLOCK F 3.63 4.75 6.98
Remaining parallel 3.52 7.78 8
Remaining sequential 2.78 3 6.03
TP (100-TS ) 97.19 96.98 93.92
TS (BLOCK C+Rem. Seq.) 2.81 3.02 6.08
Config. Ι
30 Config. ΙΙ
Config. ΙΙΙa
Config. ΙΙΙb
25
Speedup
20
15
10
5
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
Figure 5.24: Nordic variant 1, Scenario 2b: Speedup computed with Eq. 2.2
10
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
Real-time 5
Integrated ( 1-core )
Config. ΙΙ ( 6-core ) 4
700 Config. ΙΙ (12-cores)
Config. ΙΙ (24-cores)
Config. ΙΙ (36-cores) 3
600
Config. ΙΙ (44-cores)
2
Wall time (s)
500
1
400
1 2 3 4 5
300
200
100
0
0 40 80 120 160 200 240
Simulation time (s)
8 Config. ΙΙ ( 6-core )
Config. ΙΙ (12-cores)
7 Config. ΙΙ (24-cores)
Config. ΙΙ (36-cores)
6
Overrun (s)
5
4
3
2
1
0
0 10 20 30 40 50 60 70 80
Simulation time (s)
Finally, Fig. 5.26 shows the real-time performance of the algorithm with Config. II. On
this 15000-bus T&D power system, the two-level DDM-based algorithm performs faster than
real-time when executed on 44 cores. With fewer cores, some overrun is observed as shown
in Fig. 5.27.
This is the second combined T&D system, also based on the Nordic system, this time ex-
panded with 40 DNs in the Central area. In total, the system includes 3108 buses, 20 large
132 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
1.04
Scenario B: TN bus 4043
1.03
Scenario A: TN bus 4043
1.02
V (pu)
1.01
Scenario A: TN bus 1043
1
Scenario B: TN bus 1041
0.98
0 50 100 150 200 250 300
time (s)
Figure 5.28: Nordic variant 2 (Config. II): Voltage evolution at three TN buses
and 520 small synchronous generators, 600 induction motors, 360 type-3 WTs, 2136 voltage-
dependent loads, and 56 LTC-equipped transformers. The resulting DAE model has 36504
differential-algebraic states. A more detailed system description can be found in Section 1.3
and its one-line diagram in Fig. B.5.
First the system is decomposed on the boundary between TN and DNs, thus creating the
Central sub-domain with the TN and L = 40 Satellite sub-domains with the DNs. Next, each
sub-domain is decomposed into its network and injectors. More specifically, NS1 = NS2 =
... = 90 and NC = 36.
A long-term scenario is considered for this system involving the outage of transmission
line 4061 − 4062 in the South area (see Fig. B.5). The system is simulated for 300 s with
a time-step size of one cycle (20 ms). Although large, this disturbance leads to a stable
response. Furthermore, this contingency is simulated twice. In Scenario A, the DNs are pas-
sive and the reactive power set-points of the DGs remain constant throughout the simulation.
On the contrary, in Scenario B, each DN is equipped with a Distribution Network Voltage
(DNV) controller as detailed in [VV13]. The latter is centralized at the DN level. It is based
on Model Predictive Control (MPC) to keep the voltages of some DN buses within desired
limit, by changing the DG reactive powers. Each DN is equipped with such a controller, but
the various controller instances do not exchange any information. The main purpose of this
study was to investigate the contribution of DNV controllers to the bulk transmission system
5.8. EXPERIMENTAL RESULTS 133
1.02
1.01
Scenario A: DN bus 01a
V (pu)
0.97
Figure 5.29: Nordic variant 2 (Config. II): Voltage evolution at two DN buses
dynamic behavior.
As with the previous system, different base powers are considered for the Central (SbaseC =
100 MVA) and Satellite (SbaseS = 2 MVA) sub-domains. For the integrated method, the small-
est one is used for the entire system to ensure the accurate solution of the DN DGs. More-
over, only the fully accurate Configs. I and II have been considered in this study.
In Scenario A (DNV controllers not in service), DG units operate with constant reactive
power set-points and do not take part in voltage control. This leaves only the traditional
voltage control by LTCs. The long-term evolution of the system, until it returns to steady
state, is shown in Figs. 5.28 and 5.29. It is driven by the LTCs, essentially, in response to
the voltage drops initiated by the line tripping. Overall, 112 tap take place until a steady-state
equilibrium is reached.
Figure 5.28 shows the TN voltage evolution at three representative buses of the Central
area. The voltage at bus 1041 is the most impacted but remains above 0.985 pu. All DN
voltages are successfully restored in their dead-bands by the LTCs, which corresponds to a
stable evolution. For instance, Fig. 5.29 shows the voltage evolution at two DN buses: 01a,
controlled by an LTC with a [1.02 1.03] pu dead-band, and 01a − 1171, located further away
in the same DN.
In Scenario B, the same disturbance is considered but the DNs equipped with the DNV
controllers aimed at keeping the DN voltages between Vmin = 0.98 pu and Vmax = 1.03 pu.
This interval encompasses all LTC deadbands, so that there is no conflict between LTC and
DNV controllers. Each DNV controller gathers measurements from its DN, solves an MPC
problem, and every 8 − 12 s (randomly selected action interval) adjusts the reactive power
134 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
Scenario A
780 Scenario B
760
QTN-DNs (MVAr)
740
720
700
680
660
0 50 100 150 200 250 300
time (s)
Figure 5.30: Nordic variant 2 (Config. II): Total reactive power transfer from TN to DNs
Scenario A: Config. Ι
12 Scenario A: Config. ΙΙ
Scenario B: Config. Ι
10 Scenario B: Config. ΙΙ
8
Speedup
2
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
2
1
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
1100 Real-time 5
1000 Integrated ( 1-core ) 4
Config. ΙΙ ( 1-core )
900 Config. ΙΙ ( 6-cores) 3
800 Config. ΙΙ (12-cores)
2
Wall time (s)
700
1
600
500 1 2 3 4 5
400
300
200
100
0
0 50 100 150 200 250 300 350 400 450 500
Simulation time (s)
< = 16 KV
< = 26.4 KV
< = 69 KV
< = 170 KV
< = 270 KV
< = 370 KV
> 370 KV
Finally, Fig. 5.33 shows the real time performance of the algorithm for Scenario B. In
this test-case, the faster than real-time performance would allow implementing a controller-
in-the-loop structure. That is, the simulator can assume the role of the real system and
communicate the DN measurements (with some simulated measurement noise) to the DNV
controller implementations. Then, the DG set-points are calculated and communicated back
to the simulator. The simulation should be slowed down to match real time in order to test
the performance of these controllers.
In this section, the test-case studied in Section 4.9.2 is revisited with the use of the two-
level DDM. Even though this is a TN, its particular structure with radially connected sub-
transmission systems1 allows to exploit the two-level decomposition. An example is shown
in Fig. B.7. Algorithm 5.1 can be used to identify a star-shaped decomposition of the system
network. Figure 5.34 shows the full network graph; while, Fig. 5.35 shows the graph of the
Central sub-domain obtained with Algorithm 5.1. The graph analysis and the partition was
performed using NetworkX, a Python software package for the creation, manipulation, and
study of the structure, dynamics, and functions of complex networks [HSS08].
1 sometimes referred to as distribution in Canada
5.8. EXPERIMENTAL RESULTS 137
< = 16 KV
< = 26.4 KV
< = 69 KV
< = 170 KV
< = 270 KV
< = 370 KV
> 370 KV
Figure 5.35: HQ: Graph of the Central sub-domain (387 buses) with Satellite sub-domains
including more than one bus
< = 16 KV
< = 26.4 KV
< = 69 KV
< = 170 KV
< = 270 KV
< = 370 KV
> 370 KV
Figure 5.36: HQ: Graph of the Central sub-domain (963 buses) with Satellite sub-domains
including more than five buses
138 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
0.99
0.982
0.98
0.985
0.978
0.976
V (pu)
0.98
0.974
95 100 105 110 115
0.975
Single-level (Chapter 4)
Two-level (Chapter 5)
0.97
0 40 80 120 160 200 240
time (s)
Figure 5.37: HQ: Voltage evolution at bus 702 with the two DDMs (Config. II)
The graph nodes represent the network buses, and their nominal voltages are distin-
guished by different colors. The graph edges represent network lines (colored black) and
transformers (colored red). Thus, the network is split into the Central sub-domain and 450
Satellite sub-domains. With this splitting, the Central sub-domain consists of 387 buses while
the remaining 2178 buses are located in the Satellite sub-domains. However, this decomposi-
tion leads to many Satellite sub-domains consisting of one or two buses. Such small Satellite
sub-domains do not offer sufficient workload and lead to increased imbalance (especially
when the option of OpenMP for static scheduling is used) and OHC. Hence, a decomposition
with larger Satellite sub-domains has been preferred. Figure 5.36 shows the Central sub-
domain if Satellite sub-domains with at least five buses are used. This choice leads to 80
Satellite sub-domains with a total of 1602 buses; each with a substantial workload.
Since in this system the Satellite sub-domains are not DNs but part of the decomposed
TN, the same base power and latency tolerance is used for all sub-domains. That is SbaseC =
SbaseS = 100 MVA and e LS = e LC in Config. III. Furthermore, the same contingency consid-
ered in Section 4.9.2 is used for comparison reasons.
Figure 5.37 shows the voltage evolution at TN bus 702 with the two algorithms (of Chap-
ters 4 and 5) using Config. II. The dynamic response of the system is exactly the same
in both algorithms as they solve the same DAE system without any approximations. Next,
Fig. 5.38 shows the voltage evolution at the same bus with the two-level DDM with different
configurations. Concerning the dynamic response of the system, the observations made in
5.8. EXPERIMENTAL RESULTS 139
0.99
0.982
0.98
0.985
0.978
0.976
V (pu)
0.98
0.974
95 100 105 110 115
0.975
Config. ΙΙ
Config. ΙΙΙa
Config. ΙΙΙb
Config. ΙΙΙc
0.97
0 40 80 120 160 200 240
time (s)
Figure 5.38: HQ: Voltage evolution at bus 702 with the two-level DDM
Config. ΙΙΙa
0.012 Config. ΙΙΙb
Config. ΙΙΙc
0.01
0.008
Verr (pu)
0.006
0.004
0.002
0
0 40 80 120 160 200 240
time (s)
5000
Config. IIIa
4000
Active Injectors
3000
2000
1000
0
0 40 80 120 160 200 240
time (s)
Figures 5.40 and 5.41 show respectively the number of active injectors and Satellite sub-
domains during the simulation. In the short-term, all components remain active, thus the
main speedup in this period comes from the parallelization of the DDM-based algorithm. In
the long-term, the injectors start switching to latent followed by the Satellite sub-domains.
However, it can be seen that even towards the end of the simulation, many Satellite sub-
domains remain active. This is expected as the latter are part of the TN with large power
exchanges with the Central sub-domain. These power flows, as well as response of a rela-
tively large number of medium-size hydro generation plants connected to the Satellite sub-
domains, cause the latter to frequently exceed e L = 0.1 MVA and become or remain active.
In Config. IIIc, many more Satellite sub-domains become latent thus achieving a speedup of
5.8. EXPERIMENTAL RESULTS 141
90
80
Config. IIIa
Active Satellite sub-domains
70
60
50
40
30
20
10
0
0 40 80 120 160 200 240
time (s)
12
10
Speedup
4
Config. Ι
2 Config. ΙΙ
Config. ΙΙΙa
1 Config. ΙΙΙb
12 4 6 8 10 12 16 20 24 28 32 36 40 44
# of cores
Chapter 4: Config. ΙΙ
7 Chapter 5: Config. ΙΙ
Chapter 4: Config. ΙΙΙ (εL=0.1 MVA, TL=10 s)
6 Chapter 5: Config. ΙΙΙa
Scalability
1
1 2 4 6 8 10 12 16 20 24 28 32
# of cores
Figure 5.43: HQ: Effective scalability with the two decomposition algorithms
550
Real-time 5
500 Integrated ( 1-core )
Config. ΙΙ ( 1-core ) 4
450 Config. ΙΙ ( 6-cores)
Config. ΙΙ (12-cores)
400 Config. ΙΙ (24-cores) 3
350 2
Wall time (s)
300 1
250
1 2 3 4 5
200
150
100
50
0
0 40 80 120 160 200 240
Simulation time (s)
single-level DDM of Chapter 4. This gain comes from the higher parallelization percentages
achieved by the two-level DDM minus the extra OHC associated with the two-level decom-
position.
Finally, Fig. 5.44 shows the real-time performance of the algorithm with Config. II. Faster
than real-time performance is achieved when executed on 24 or more cores, similarly to the
algorithm of Chapter 4.
5.8. EXPERIMENTAL RESULTS 143
5.8.4 Discussion
The capabilities of the proposed two-level algorithm are discussed in this subsection. Its
sequential and parallel performances are outlined, as well as some more results on the UMA
Machines 2 and 3 of Section 2.7.
Similarly to the algorithm of Chapter 4, the main sources of acceleration in sequential exe-
cution (M = 1) are the localization techniques. However, the proposed two-level DDM has
some significant extra OHC compared to the integrated, as well as the single-level DDM.
Configuration I does not use any localization techniques and follows the same Jacobian
matrix updates as the integrated. Except for the short-term test case of the first T&D sys-
tem, the sequential execution of Config. I is slower than the integrated. As discussed in
Section 4.9.4.1, this slowdown can be attributed to the extra OHC of the proposed DDM.
On the contrary, Configs. II and III take advantage of the localization techniques de-
scribed in Section 5.5 and outperform the integrated, leading to fast sequential simulations.
The proposed two-level algorithm offers significant speedup in parallel execution. Figures 5.17,
5.24, 5.31, and 5.42 show this speedup as a function of the number of cores used, reaching
22.2 times for full accurate simulations and much higher when latency technique is employed.
Even though the two-level algorithm has higher OHC than the previous algorithm (due
to the more complex decomposition scheme), it has a larger percentage of work done in
parallel (TP ). In addition to the treatment of injectors that was performed in parallel as in
the single-level DDM, this algorithm also treats in parallel the Satellite sub-domain networks.
Thus, the sequential bottleneck coming from of the Schur-complement approach of treating
the interface variables is limited to solving the global reduced system (5.15). The size of the
latter sparse linear system is double the number of buses of the Central sub-domain network.
This large percentage of parallel work leads to increased parallelization efficiency. That is,
the proposed algorithm scales over a larger number of computational cores than the previous
one and allows to better exploit the available computational resources.
For the previous simulations, Machine 1 was used to “scan” a varying number of cores and
show the performance and scalability of the algorithm. However, the proposed two-level
DDM can provide significant speedup even on smaller UMA machines. For this reason, the
scenario of Section 5.8.1 was executed on Machines 2 and 3 to show the performance of
the algorithm. The dynamic response is not presented as it is identical to the one shown
previously.
144 CHAPTER 5. PARALLEL TWO-LEVEL SCHUR-COMPLEMENT-BASED DDM
Table 5.10: Nordic variant 1, Scenario 2a: Execution times of UMA machines
4 Machine 3 (4 cores)
3.5
3
Overrun (s)
2.5
2
1.5
1
0.5
0
0 20 40 60 80 100 120 140 160 180 200 220 240
Simulation time (s)
Figure 5.45: Nordic variant 1, Scenario 2a: Overrun of the algorithm on Machine 3 with
Config. II
Table 5.10 shows the execution times and speedup of the simulation on the aforemen-
tioned laptops. As with the results of Section 5.8.1, in the sequential execution with Config. I,
the proposed two-level DDM shows some slow-down due to its OHC. However, on Machine 2
(resp. 3) the sequential execution offers a speedup of 1.2 (resp. 1.3) for Config. II and 1.6
(resp. 1.9) for Config. IIIb.
In parallel execution, a speedup of 2.0 (resp. 3.6) is achieved with Config. II using the two
(resp. four) cores of the laptop. Configuration IIIb shows a speedup of 2.6 (resp. 5.3), simu-
lating this 14500-bus (modeled with 143000 DAEs) system in 209 s (resp. 88 s). Figure 5.45
shows the overrun of the simulation on Machine 3 in parallel execution. It can be seen that
there is an overrun of 3.5 s, thus, real-time simulations are possible for this system.
Overall, the proposed two-level DDM can provide significant speedup on normal laptop
computers, thus allowing to perform fast and accurate power system dynamic studies without
the need of expensive equipment.
5.9. SUMMARY 145
5.9 Summary
The DDM-based algorithm presented in this chapter relies on a two-level decomposition. The
first decomposition partitions the network to reveal a star-shaped layout. Then, the injectors
are isolated from each sub-network, similarly to the single-level algorithm of Chapter 4. The
sub-domains formulated on both levels are treated independently and in parallel while their
interface variables are updated hierarchically using a Schur-complement approach. Fur-
thermore, three localization techniques were presented to accelerate the simulation both in
sequential and parallel execution.
This algorithm allows to accelerate the simulation of T&D systems, by exploiting their
particular structure and employing an hierarchical, two-level, parallel DDM. Moreover, the use
of Schur-complement approach to treat the interface variables at each decomposition level,
allows to retain high convergence rate of the algorithm. Finally, the sequential bottleneck
of the Schur-complement-based approach can be significantly decreased with the proper
selection of the Central sub-domain, allowing to achieve high scalability.
C HAPTER 6
General conclusion
147
148 CHAPTER 6. GENERAL CONCLUSION
infrequent matrix update and factorization for each sub-system. The interface variables are
updated using a Schur-complement approach, at each decomposition level, at each iteration.
In Sections 4.7 and 5.6, it was shown that the proposed DDMs are mathematically equiv-
alent to solving the nonlinear DAEs with a unique (integrated) quasi-Newton method. This
feature is due to the Schur-complement approach used to update the interface variables
at each sub-system iteration. This observation allows to use the extensive theory behind
quasi-Newton schemes to assess the proposed DDMs convergence.
Both proposed DDMs have been accelerated computationally and numerically.
First, the independent calculations of the sub-systems have been parallelized providing
computational acceleration. This parallelization potential is inherent to DDMs and the main
reason for using such methods. DDMs using a Schur-complement approach to update the
interface variables, such as the ones proposed in this work, suffer from an unavoidable se-
quentiality that can hinder their parallel performance. However, it was shown that selecting a
star-shaped decomposition, exploiting the sparsity of the central sub-domain, and using an
infrequent update of the reduced matrix, minimizes the computational cost of this sequen-
tial task and allows for higher scalability. In this work, the OpenMP shared-memory parallel
computing API has been used to implement both algorithms, thus allowing to exploit modern
multi-core computers. A large multi-core computer, setup and maintained for the purpose of
this work, was used to asses the performance of the algorithms.
Second, three numerical acceleration techniques have been used to exploit the locality
of the decomposed sub-systems and avoid unnecessary computations. The first technique,
considers skipping the solution of sub-systems that have already converged to the required
accuracy within one time-step solution. The second technique, suggests the asynchronous
update of sub-system matrices and of the reduced systems. The last technique, replaces
the dynamic models of sub-systems with low dynamic activity, with linear, sensitivity-based,
equivalent ones computed at the moment of switching. On the contrary, when high dynamic
activity is detected, the original dynamic models are reinstated. In this work, these three
techniques have been extended and enhanced to consider the two-level decomposition al-
gorithm. Moreover, a new metric, stemming from real-time digital signal processing, has
been proposed to quantify the level of sub-system dynamic activity.
Furthermore, it was shown that the proposed DDMs using these acceleration techniques,
are mathematically equivalent to solving the nonlinear DAEs with a unique (integrated) Inex-
act Newton method. The first two acceleration techniques do not disturb the accuracy of the
simulation response. However, the third technique, latency, allows to achieve high simulation
performance while sacrificing some accuracy. In this case, the compromise between speed
and accuracy can be easily tuned using only a couple of parameters (latency tolerance and
time window observation).
The performance and accuracy of the proposed DDMs have been tested on the power
system models presented in Section 1.3. It was shown that the proposed algorithms can
achieve high speedup and scalability, both on scientific computing machines, as well as on
6.1. SUMMARY OF WORK AND MAIN CONTRIBUTIONS 149
normal laptop computers. Next, the real-time capabilities of the DDMs were examined, and it
was shown that in all systems, applications with “soft” real-time requirements (allowing some
limited overrun) can be envisioned. In addition, for some of the systems studied (Nordic and
HQ), applications demanding “hard” real-time (no overrun) are also possible.
Finally, it was shown that in long-term scenarios, combining the localization and parallel
computing techniques, provides the best simulation performance. On the one hand, when the
system exhibits high dynamic activity (usually during the short-term dynamics), the parallel
computing techniques offer higher speedup as more frequent matrix updates and system
solutions are necessary, which are performed in the parallel sections of the algorithms. On
the other hand, the localization techniques perform better when the system evolves smoothly,
without much dynamic activity (usually during the long-term dynamics).
The algorithms proposed in this work have been implemented in the dynamic simulation
software RAMSES. The latter is currently used by researchers at various universities and
research institutes as their main dynamic simulation and development platform, e.g.:
• Design and validation of Active Distribution Network (ADN) control schemes (University
of Liège – Belgium, École Centrale de Lille – France, University of Costa Rica);
• Robustness and defense plans in mixed AC-DC systems (University of Liège – Belgium,
a research supported by the R&D department of RTE1 );
• Voltage stability analysis and preliminary reactive power reserve planning of the future
German grid (just launched collaboration with Amprion5 ).
To conclude, the current trend for power system dynamic simulations is to demand more
detailed, and thus bigger and more complex, injector models. In addition, the most noticeable
developments foreseen in power systems involve Distribution Networks (DNs), which are
expected to host a big percentage of the renewable energy sources. Thus, DN equivalencing
will become more and more difficult, and the need for detailed representation more acute.
1 The French TSO
2 École Nationale Supérieure d’Arts et Métiers
3 École Polytechnique Fédérale de Lausanne
4 Institut de recherche en électricité du Québec
5 The largest TSO in Germany
150 CHAPTER 6. GENERAL CONCLUSION
The proposed DDMs decompose power system models in such way that the future increase
in computational demand, due to the previous observations, will increase the percentage
of work in the parallel sections of the algorithms. In that way, Gustafson’s law is validated.
That is, using the same algorithms and computational equipment, it is expected that with
future demands in power system modeling, the DDMs will provide even higher scalability
and speedup.
• One of the main requirements of the proposed two-level DDM is that there is only
one point of connection between the Central and Satellite sub-domains. The benefit
of having Satellite sub-domains with only one point of connection is that during the
elimination procedure for the formulation of the global reduced system, the original
sparsity pattern of the Central sub-domain network is preserved. However, in some
power systems, there exist sub-transmission networks that are connected to the bulk
transmission system at several buses. A modified two-level algorithm can be designed
to exploit these sub-transmission networks as Satellite sub-domains to accelerate the
simulation procedure. In this case, their Schur-complement terms will introduce some
fill-ins to the sparsity of the global reduced system, similarly to the twoports in the
reduced system of the single-level algorithm.
• The latency-localization technique used by both the proposed algorithms, provides sig-
nificant acceleration at the cost of introducing some inaccuracy to the simulated re-
sponse. One of the sources of inaccuracy is the linear sensitivity-based model used
when latent. The latter represents only the interface of the injector or the Satellite sub-
domain to the transmission network. That is, the linear model involves and updates only
the two interface states, while all the other states of the model are frozen at the mo-
ment of becoming latent. Consequently, when the model switches from latent to active
again, there is an inconsistency between the interface and the remaining states, which
creates a discrete jump in the equations and can disturb the quasi-Newton method.
To avoid the latter problem, a new linear sensitivity-based model can be used that
updates all the states of the latent component. This way, when switching from latent
to active, the states of the model will be all updated and the equations consistent.
Nevertheless, this entails the extra cost of updating larger linear models.
• The proposed DDMs can be also used in other power system applications involving
linear systems derived from the same power system models. One such computationally
demanding task, is the eigenvalue analysis of large-scale power systems. In [RM09], it
6.2. DIRECTIONS FOR FUTURE WORK 151
was proposed to exploit the sparsity of the power system Jacobian matrix used for the
calculation of the eigenvalues, in order to accelerate the calculation of the rightmost
eigenvalues. The main computational effort of methods like the Implicitly Restarted
Arnoldi with Shift-and-Invert [RM09], is the (repetitive) solution of linear systems of the
form:
(JS −σI )x = b (6.1)
where JS is the Jacobian state-space matrix derived after the elimination of the alge-
braic states, I is the unit matrix, and σ e C is the shift. It was shown [RM09] that for
better performance, the latter equation can be rewritten as:
The benefit is that J (contrary to JS ) is a sparse matrix with a structure similar to that
of J in Eqs. 4.20 or 5.19, thus fast sparse solvers can be used. However, based on
the observations made in Sections 4.7 and 5.6, an equivalent solution of a system with
the form (6.2) can be performed by the single-level or two-level algorithms and thus the
use of parallel computing techniques could be envisaged to accelerate its solution.
Appendices
153
A PPENDIX A
Analysis of Newton-type schemes
A.1 Review
In numerical analysis, Newton’s method (also known as the Newton–Raphson method),
named after Isaac Newton and Joseph Raphson, is a root-finding algorithm that uses the
first few terms of the Taylor series of a function f (y) in the vicinity of a suspected root (y0 ).
The Taylor series of f (y) about the point y1 = y0 + e is given by:
1
f (y1 ) = f (y0 ) + f˙(y0 )(y1 − y0 ) + f¨(y0 )(y1 − y0 )2 + ... (A.1)
2
Ignoring the higher-order terms, we obtain:
This expression above can be used to estimate the correction needed to land closer to the
root starting from an initial guess y0 . Setting f (y1 ) = 0 and solving for y1 − y0 gives:
f ( y0 )
y1 = y0 − (A.3)
f˙(y0 )
which is the first-order adjustment to the root’s position. The process can be repeated until it
converges to a fixed point (which is precisely a root) using:
f (yn )
y n +1 = y n − (A.4)
f˙(yn )
for n = 0, 1, 2, ....
This method can be generalized to a system of m algebraic equations with m unknowns
F (y ) = 0. Everything remains the same, except that the first derivative of F is replaced by
the m × m Jacobian matrix J = ∂y .
∂F
Thus, the iterations are:
−1
∂F
yn +1 = yn − (yn ) F (yn ) (A.5)
∂y
155
156 APPENDIX A. ANALYSIS OF NEWTON-TYPE SCHEMES
for n = 0, 1, 2, ....
In real applications, the inverse Jacobian matrix is never computed. Rather than comput-
ing directly yn+1 , the correction ∆yn = yn+1 − yn is calculated by solving the linear system:
∂F
(yn ) ∆yn = −F (yn ) (A.6)
∂y
| {z }
Jn
and then updating the variables:
yn+1 = yn + ∆yn (A.7)
When the Jacobian matrix is sparse [TW67], such in the case of power system problems, the
use of sparse linear solvers makes the solution of (A.6) very efficient.
The most computationally expensive part of the solution of (A.6) is the calculation of
the Jacobian matrix and its factorization. Thus, to increase the method’s performance, an
infrequent update and factorization can be employed. That is, the Jacobian matrix is kept
the same for several consecutive solutions and it is only updated after a predefined number
of iterations or when the method fails to converge. These methods are called quasi-Newton
or perturbed Newton in mathematics [Kel95] and are referred to as Very DisHonest Newton
(VDHN) in power systems [Mil10].
Both the exact and the quasi-Newton methods have been extensively studied and their
convergence criteria and capabilities have been presented in several publications [Bro70,
BDM73, DM74, GT82, DW84, Kel95].
yn+1 = yn + δyn
where rn is the inaccuracy. These schemes are referred to as Inexact Newton (IN) [Cat04].
Combining (A.6) and (A.8) leads to:
Thus the difference between the IN scheme and the exact Newton method can be measured
by either one of the following relative errors [DES82]:
kδyn − ∆yn k k rn k
or (A.10)
kδyn k kF (yn )k
A.2. INEXACT NEWTON SCHEMES 157
Of course, it depends on which one of these error measures is available or can be estimated
during the course of the process.
The required criteria for the IN scheme to converge to the same value as the exact Newton
have been shown in [DES82]. Briefly, if the normal convergence requirements for Newton
methods [DES82] are satisfied and in addition:
k rn k
<η<1 ∀n ≥ 0 (A.11)
kF (yn )k
then, the IN scheme converges to the same value as the exact Newton.
It is noteworthy that the quasi-Newton methods described in the previous section can be
also analyzed as IN methods. Let us consider the following quasi-Newton (or VDHN) method:
yn+1 = yn + δyn
where J˜n = Jn + ∆Jn is the outdated Jacobian and ∆Jn is the error from the correct (up-
dated) Jacobian. Equation A.12 can then be rewritten as:
Tr1
Substation
TN Bus Tr2
Reactor Reactor Reactor Reactor Reactor
Tr3
= + M
159
160 APPENDIX B. TEST-SYSTEM DIAGRAMS
g19 g9 g2
EQUIV.
4012 1012 1014
4072
g10 g1 g3
g5 g11
g20 NORTH
TN
1021 1022 4022 4021
DN
g12
g4
2032 2031 4031 4032
g8
4042
4041
CENTRAL
g14
g13 cs
g7
4044 4043 4046
4061
g6
1043 1044
1042
g17
1041 1045 g15
g16
4062 4045
4047
4063 4051
g18 SOUTH
g19 g9 g2
EQUIV.
4012 1012 1014
4072
g10 g1 g3
g5 g11
g20 NORTH
TN
1021 1022 4022 4021
DN
g12
g4
2032 2031 4031 4032
g8
4042
4041
CENTRAL
g14
g13 cs
g7
4044 4043 4046
4061
g6
1043 1044
1042
g17
1041 1045 g15
g16
4062 4045
4047
4063 4051
g18 SOUTH
Figure B.3: Nordic variant 1: Nordic32 system expanded with 146 Lelystad distribution sys-
tems
162 APPENDIX B. TEST-SYSTEM DIAGRAMS
1151
1137 1101
1102
1146 1152
1126 1115
1138
1139 1153
1127 1167 1116 1103
1147 1133 1122
1140 1154 1117
1128
1129
1141 1155 1118
1130 1168 1123
1156 1119
1148 1142
1169 1124
1135 1131 1134
1143 1120
1157
1121
1144
1149 1132 1158 1125
1170
1145 1159
1136
1150 Load bus
1160
1171
1166 1164 1162 DG unit
1165 1163 1161
Voltage measurement
1172
1175 1174 1173
g19 g9 g2
EQUIV.
4012 1012 1014
4072
g10 g1 g3
g5 g11
g20 NORTH
TN
1021 1022 4022 4021
DN
g12
g4
2032 2031 4031 4032
g8
4042
4041
CENTRAL
g14
g13 cs g7
4044 4043 4046
4061
g6
1043 1044
1042
g16
4062 4045
4047
4063 4051
g18 SOUTH
Figure B.5: Nordic variant 2: Nordic32 system expanded with 40 75-bus distribution systems
164 APPENDIX B. TEST-SYSTEM DIAGRAMS
Centrale thermique
i l privé
Tracé de 1927 du Conse
(non définitif )
Romaine-2
Île d’Anticosti
Îles de la
Madeleine
La Citière
40°
(23) (6)
66 (14) 33 (7) CS1 CS2 CS3
L1414
XC8 XC7 XC6 XC5 25,7 kV
L1415
121 kV
130 38 T1
XC60@71
( 12) (2) 7/17 (19)
L1156
39 T2 77
L1157
B8040 B8041
(2) 7/17 (9)
L1139
L1413
15 38 T3 XC81
(6)
L1135
L1138
(4) (5) 7/17
38
B8042
(5)
XC1@2 B8032 XC3@4 B8039
25,3 kV
(17) (17)
L1178
L1136
74 (12) 74 (12)
T1 T2 T3 T4
B7620 121 kV 8/17 8/17 8/17 8/17
L1179
L1137
(12) 82 (11) (17)
82 (11)
72 (11) 36 (6) 72 (11) L1136 L1179 39 T2 77
L1179 B8858
T1 T2 T3 T4 (14) (5) (27) TAP B8854 (3) 9/17 (17)
9/17 9/17 8/17 8/17
XC31 32 B7619 XC34 B7621 XC35 36 B7622 T3
41 (6) 41 (4) L1178 39
XC3@4
41 (6) 41 (3) (3) 9/17 (23)
B8025
B8026 118 kV T4
39 77 B8859
120 kV
(3) 9/17 (17)
L1178
L1136
TAP B8023
25,3 kV
TAP B8024
XC5 6 B7602
121 kV B7635
(25) 90 (14) TAP B7628
25 (1)
24 (1) 25 ( 4)
TAP B7629
T6 T7
6/17 6/17 T1 T2 T3
7/17 7/17 5/17
45 (3) 46 (3) B7630 121 kV
122 kV (11)
39 (5) (11) 49 (8) 25 (4)
B7605 40 (5) 40 (5) 40 (5) 40 (5)
45 (4) 46 (3) T2 XC62 72 B7640 XC63 73 B7641
T3 8/17 8/17 8/17 8/17 8/17
6/17 6/17 13 (2) 26 (3)
T5 T4 T1 T2 T4 T5 T6
B7600 64,2 kV
40 (10) 40 (10) 40 (10) 79 (20)
(25) 90 (14) (11) (11) (11) (23)
L625
L624
XC31 B7631 XC32 B7632 XC34B7633 XC35 36 B7634
XC3 4 B7601
25,4 kV B2430
39 ( 6)
T2
B130
39 (8)
12,0 kV
A3@8 Satellite sub-domains
165
IS
NO SE FI
EE
LV RU²
DK
IE LT
RU²
GB
BY²
NL
PL
BE DE
LU
CZ
SK UA-W¹ UA²
FR CH AT MD²
HU
SI HR RO
BA RS
IT BG
PT ES ME
MK
AL¹
GR TR³
Figure B.8: European map with indication of the interconnected ENTSO-E system members
(2011); continental European synchronous area is represented in the PEGASE system
C.1 Introduction
Power systems are equipped with more and more controls reacting to disturbances, with ei-
ther beneficial or detrimental effects. Thus, static security analysis is no longer sufficient and
dynamic responses need to be simulated. Moreover, larger and larger models are considered
deriving form the simulation of large interconnections or the incorporation of sub-transmission
and distribution levels. In addition, longer simulation times need to be considered in order to
check the response of system up to its return to steady state (long-term dynamics can take
up to several minutes after initiating event). Finally, there is a demand for faster than real-time
simulators with look-ahead capabilities (hardware or controller-in-the-loop tests, closed-loop
remedial actions in control centers, etc.).
To cover this need for faster power system dynamic simulations, a simulator called RAM-
SES (stands for “RApid Multithreaded Simulator of Electric power Systems”) is developed
at the University of Liège since 2010. The main developers are Petros Aristidou, Davide
Fabozzi, and Thierry Van Cutsem. There are three contributing components to the develop-
ment of this software (see Fig. C.1), all of which are tightly coupled to the DDM-based sim-
ulation algorithms used. The first, is how the power system is modeled. The second relates
to the acceleration techniques used to provide fast and accurate simulation responses. The
final component is the software implementation, that is, how the simulator is implemented
and how the user interfaces with it.
In the remaining of this appendix, some of these aspects will be summarized.
167
168 APPENDIX C. RAMSES
Equations (DAEs). Allowing for algebraic equations in the models yields higher flexibility. Fur-
thermore, the equations can change between differential and algebraic during the simulation,
due to discrete changes in the model. For an injector, the interfacing is shown in Fig. C.3.
When the decomposition presented in Chapter 5 is used, the AC network is also split into
the Central and the Satellite sub-domains.
Both decomposition schemes promote modular modeling. Some well-defined rules have
been set for the interfacing of injectors (or two-ports) with the network and Satellite to the
Central sub-domain. Thus, as long as these rules are followed, the power system mod-
els used in RAMSES are modular. That is, components can be easily added, removed, or
combined to form complex models without sacrificing the simulation performance.
A tool is provided for the user to program new models that can be included in RAMSES.
Figure C.4 shows the procedure for introducing user-defined models to the simulator. At the
moment of writing this thesis, the following four types of models are supported: torque control
of synchronous machine, excitation control of synchronous machine, injector, and twoports.
Apart from the components described above, there is a separate class of components
called Discrete-time ConTroLlers (DCTLs). These controllers act at discrete times, when a
condition is fulfilled, or at multiple of their internal activation period. Their actions are applied
after the simulation time step is completed. Examples of such DCTLs are the LTCs, ASRTs,
and MPC-based DNV controllers, seen in Chapters 4 and 5.
C.2. POWER SYSTEM MODELING 169
(future)
multi-terminal DC grid
voltage-
source
converters
two-ports
AC network (connected to
two buses)
description
of new model CODEGEN fortran 2003
in text format code RAMSES
code COMPILER executable
happy generation
RAMSES remaining of with new model
utility INTEL or GNU
user RAMSES
in a library
.dll or .so
EmployCaCDDM-basedCalgorithm
UseCshared-memoryCparallelCprocessingCtechniques
toCaccelerateCtheCsolutionCofCtheCdecomposedCDAEC
system
SomeCdisturbancesCaffectConly
aCsubsetCofCcomponentsCwhile
theCremainingCareConlyCslightly
Exploit influenced
UseCtime-averagingCtoCDfilterD
Parallel
outCfastCdynamicsCandCconcen- ConvergedCsub-systemsCstopC
Processing
trateConCaverageCevolution beingCsolved
UseCforClong‐termCdynamics Sub-systemCmqtricesCareCupda-
tedCasynchronously
UseCDstiff‐decayDCkL-stable)C
Fast
integrationCscheme DuringCtheCsimulation,Csub-
Dynamic
Exploit Simulations systemsCshowingChighCdynamicC
UseCDlargeDCtime-steps Time-scale activityCareCclassifiedCasCactive
Decomposition Exploit
andCtheirCdynamicCmodelsCare
UseCproper,Cex-post,Ctreatment of Dynamic Localized
Response to used.CSub-systemsCshowingClowC
ofCdiscreteCevents Phenomena dynamicCactivityCareCclassified
Disturbances
asClatent andCareCreplacedCby
simple,Clinear,Cautomatically
calculatedCmodels.
PossibilityCforClook-ahead,CfasterCthanCreal-timeC
powerCsystemCdynamicCsimulations
When considering long-term dynamic simulations (i.e. for long-term voltage stability), some
fast components of the response may not be of interest and could be partially or totally omit-
ted to provide faster simulations. This can be achieved either through the use of simplified
models [VGL06] or with a dedicated solver applying time-averaging [FCPV11].
While model simplification offers a big acceleration with respect to detailed simulation,
some drawbacks exist. First, the separation of slow and fast components might not be possi-
ble for complex or black-box models. Furthermore, there is a need to maintain both detailed
and simplified models. Finally, if both short and long-term evolutions are of interest, simplified
and detailed simulations must be properly coupled [VGL06].
At the same time, solvers using “stiff-decay” (L-stable) integration methods, such as BDF,
with large enough time-steps can discard some fast dynamics. Such a solver, applied on a
detailed model, can “filter” out the fast dynamics and concentrate on the average evolution
C.4. SOFTWARE IMPLEMENTATION 171
of the system. The most significant advantage of this approach is that it processes the de-
tailed, referenced model. Furthermore, this technique allows combining detailed simulation
for short-term by limiting the time-step size, and time-averaged long-term by increasing it.
This time-averaging technique, proposed in [FCPV11], is available to be used in RAMSES
for the discretization of the DAEs. However, in the results shown in this thesis, this tech-
nique was not used. That is, a small and constant time-step size was used to allow for the
assessment of the acceleration provided by the other two techniques only.
Fortran is, for better or worse, one of the main programming languages specifically de-
signed for scientific numerical computing. It has advanced array handling functions, with suc-
cinct array operations on both whole arrays and slices, comparable with MATLAB or Python
Numpy, but much faster. The language is carefully maintained with speed of execution in
mind. For example, pointers are restricted in such a way that it is immediately obvious if
there might be aliasing. It has advanced support for shared-memory parallel computing
(through OpenMP), distributed-memory parallel computing (through co-arrays and MPI), and
vectorization.
Fortran has a huge long tradition, and this can be considered as an advantage and dis-
advantage together. On the one hand, there is a plethora of great libraries written in Fortran
(with BLAS and LAPACK as examples). On the other hand, it comes with much historical
baggage targeting to keep backward compatibility. Nevertheless, modern Fortran implemen-
tations have little to envy from other languages.
172 APPENDIX C. RAMSES
When the implementation performs a lot of number crunching, Fortran remains one of
the top choices. That is the reason why many of the most sophisticated simulation codes
running at supercomputing centers around the world are still written in it. On the contrary,
Fortran would be a terrible language to write a web browser, perform communication tasks,
manipulate databases, etc. To each task its tool.
The core of RAMSES, which performs the numerical simulation, is written in Fortran for
increased performance. However, using the C-interoperability functions of Fortran, several
interfaces have been developed to facilitate its use.
The most basic interface provided by RAMSES is the command line mode. The user exe-
cutes the program and can provide the necessary files needed for the simulation (test-system
description, contingency description, etc.) in an interactive way or in batch execution. This
interface is useful for remote execution on systems without graphic interfaces and to be em-
bedded in scripts as part of more complex procedures.
RAMSES is also provided as a dynamic library (.dll in Windows or .so in Linux). This option
allows the user to include the simulator in his own software or load it in a script (e.g. Python,
Perl, etc.). The connection to the library is performed through C-type calls and the user can
start, stop, pause, or modify the simulation. Moreover, the calling software has access to all
the outputs of the simulator (through “get” subroutines) and can take decisions based on the
dynamic response of the simulation.
A Java-based Graphic User Interface was developed that allows for an easy-to-use and stan-
dalone execution of the simulator. Moreover, the JAVA language provides compatibility with
multiple platforms (Linux and Windows). The software is provided in a single JAVA archive
(.jar file), including all necessary executables and libraries to perform simulations and visual-
ize results. Thus, the software is ready to be used with no installation.
C.4.5 MATLAB
RAMSES can interface with MATLAB in two ways. First, the RAMSES dynamic library can
be embedded in MATLAB scripts as part of more complex processes. These processes
can include the preparation of the dynamic data to be used for the simulations as well as
post processing and control action to be taken based on the simulation output. In this case,
MATLAB is the simulation driver and RAMSES is used a fast dynamic simulation library.
C.5. VALIDATION 173
1
Voltage magnitude (pu)
1
0.8
0.8 EMTP-RV
RAMSES
0.6 0.6
0.4
0.4
1 1.05 1.1 1.15
0 1 2 3 4 5 6
time (s)
In addition, RAMSES can interface with MATLAB using the MATLAB Engine mechanisms,
for rapid prototyping of power system components and discrete controllers (DCTLs). In this
case, RAMSES acts as the simulation driver and MATLAB is called to compute the response
of the components being developed or tested.
C.5 Validation
The dynamic response of RAMSES was validated against the well-known power system dy-
namic simulation software EMTP-RV. The latter is a EMT type software while RAMSES uses
the phasor approximation (see Section 1.2). In general, EMT software are more accurate
than phasor mode ones as more detailed models are used and the full wave is simulated,
rather than phasors rotating at the nominal frequency [Yan14]. However, EMT simulations
are more time consuming and the power systems being studied are usually limited in size.
Two scenarios, based on the Nordic system presented in Section 1.3, were used to com-
pare the response of the two software. For the comparison of the bus voltage evolution, the
fundamental frequency (50 Hz) wave was extracted from EMTP-RV. The simulations were
performed on a multi-core laptop computer (Machine 3) and the single-level, parallel algo-
rithm of Chapter 4 was used for RAMSES. The results shown in this appendix were part of the
development of a co-simulation algorithm [PAGV14]. A journal paper has been submitted1 .
1 F.
Plumier, P. Aristidou, C. Geuzaine, T. Van Cutsem, "Co-simulation of Electromagnetic Transients and
Phasor Models: a Relaxation Approach", Submitted to IEEE Transactions on Power Delivery, 2015.
174 APPENDIX C. RAMSES
1.008 1.008
1.006 1.004
machine speed (pu)
1.004
1
1.002
1 1.05 1.1 1.15
0.998
EMTP-RV
RAMSES
0.996
0 1 2 3 4 5 6 7 8 9 10
time (s)
C.5.1 Scenario 1
The disturbance of concern is a three-phase solid fault at t = 1 s on line 4046 − 4047, near bus
4047, lasting five cycles (at 50 Hz) and cleared by opening the line, which remains opened.
Following, the system is simulated over an interval of 10 s. In RAMSES, a time-step size of
one cycle is used, while in EMTP-RV of 100 µs.
Figure C.6 shows the voltage evolution on transmission bus 4043 computed from both
software. In the zoom of the same figure, it can be seen that the bus voltage in RAMSES
drops directly after the fault, while in EMTP-RV there is a time constant. This is because
RAMSES, as a phasor mode simulator, uses only algebraic equations to model the network,
while EMTP-RV a more accurate, differential model. Nevertheless, the dynamic response of
RAMSES matches EMTP-RV.
Figure C.7 shows the machine speed of generator g15b. Disregarding some higher fre-
quency components in EMTP-RV (which are not captured by phasor mode simulations), the
two curves match.
C.5.2 Scenario 2
The disturbance of concern is a three-phase solid fault at t = 1 s on line 1042 − 1044, near bus
1042, lasting 10.5 cycles (at 50 Hz) and cleared by opening the line, which remains opened.
Following, the system is simulated over an interval of 10 s. In RAMSES, a time-step size of
one cycle is used, while in EMTP-RV of 100 µs.
C.5. VALIDATION 175
EMTP-RV
1.14 RAMSES
1.12
V4041 (pu)
1.1
1.08
1.06
0 1 2 3 4 5 6 7 8 9 10
time (s)
1.14
1.12
1.1
V4044 (pu)
1.08
1.06
1.04
1.02
0 1 2 3 4 5 6 7 8 9 10
time (s)
1.14
1.12
1.1
V4042 (pu)
1.08
1.06
1.04
1.02
0 1 2 3 4 5 6 7 8 9 10
time (s)
1.016
1.02 1.012
1.008
machine speed (pu)
1.01 1.004
1
1 1.1 1.2
1
0.99
EMTP-RV
RAMSES
0 1 2 3 4 5 6 7 8 9 10
time (s)
1.002
1.0016
machine speed (pu)
1.0012
1.0008
1.0004
EMTP-RV
0.9996 RAMSES
0 1 2 3 4 5 6 7 8 9 10
time (s)
Figure C.8 shows the voltage evolution on three transmission buses computed from both
software. It can be seen that the dynamic response of RAMSES matches EMTP-RV.
Figure C.9 shows the machine speed of generators g6 and g18. Disregarding some higher
frequency components in EMTP-RV (which are not captured by phasor mode simulations),
the two curves match.
A PPENDIX D
Numerical profiling
In this appendix, two example numerical profilings of the results presented in Sections 4.9
and 5.8 are shown. These profilings are representative of how the localization techniques
modify the number of operations in the DDM algorithms and are important to understand
their performance.
Table D.1 shows the number of operations performed, over all time steps and iterations for
the test-case of Section 4.9.2. Since the DDM used is not a relaxation technique but a direct
method, the number of operations remain the same either in parallel or sequential execution.
As shown in Section 4.7, Config. I is equivalent to the integrated and, since the same matrix
update and solution criteria are used for both algorithms, the number of matrix updates and
solutions are the same. Thus, the integrated method performs approximately 1054 integrated
Jacobian updates and factorizations and 23886 system solutions. Some deviations might
exist due to numerical differences of the solvers used. That is, in the integrated method, the
entire system is solved with the sparse solver HSL MA41, while in the DDM the injectors are
solved using the Intel MKL Lapack routines.
Comparing Config. I with Config. II and III, it can be seen that in the latter two, the number
of reduced matrix updates and solutions significantly decreases due to the asynchronous
update (Section 4.6.2) and the skipping of converged sub-systems (Section 4.6.1). The same
observation is true for the injector system updates, factorizations and Schur-complement
term computation as well as for the injector system solutions.
On the other hand, the number of reduced system Right-Hand Side (RHS) evaluations in-
creases in Config. II and III. The reason for this is that the localization techniques disturb the
convergence rate of the algorithm as described in Section 4.7. Thus, when the localization
techniques are used, the algorithm requires on average more iterations per time instant to
converge; although, each iteration is computationally much cheaper than without the local-
ization techniques. Nevertheless, the number of reduced system RHS evaluations increases
proportionally to the total number of iterations as this is needed to ensure and check the
convergence of the algorithm.
177
178 APPENDIX D. NUMERICAL PROFILING
Table D.1: Chapter 4, HQ: Numerical profiling of Algorithm 4.3 in test-case of Section 4.9.2
Finally, the number of injector system RHS evaluations increases in Config. II and de-
creases in III. For the first increase, the reason is the same as above; the higher number
of iterations leads to proportionally more RHS evaluations to guarantee the algorithm’s con-
vergence. However, in Config. III, several of the injector models are replaced by the linear
equivalents as described in Section 4.6.3. Since these models are linear, their RHS after
each iteration solution is zero, thus there is no need to be computed.
Table D.2 provides the same information over all time steps and iterations for the test-case
of Section 5.8.1.2, using the two-level DDM. The integrated method performs approximately
the same integrated Jacobian updates and system solutions as the global reduced system
of Config. I. In reality, the integrated scheme needs more iterations per time instant to reach
the desired convergence accuracy. The reason relates to the selection of the base power
as explained in Section 5.3.4. That is, the integrated scheme scheme uses the smaller
base power Sbase = 2 MVA for the entire system to ensure the accuracy of the Satellite
sub-systems, while the two-level DDM uses SbaseC = 100 MVA and SbaseS = 2 MVA.
Comparing Config. I with Config. II and IIIa, it can be seen that in the latter two, the
number of global and satellite reduced matrix updates and solutions are fewer due to the
asynchronous update and the skipping of converged sub-systems. Similarly for the injector
system updates, factorizations and Schur-complement term computation as well as for the
injector system solutions.
However, the reduced and injector system updates are more in Config. IIIa than II. When-
ever a Satellite sub-domain or injector is switched from active to latent, an update of its
1 In parenthesis is the average value per injector (N=4601).
2 Solutions of the linear equivalent systems not counted.
179
Table D.2: Chapter 5, Nordic variant 1, Scenario 2b: Numerical profiling of Algorithm 5.4 in
test-case of Section 5.8.1.2
Configuration I II IIIa
Speedup (M = 1) 0.9 1.1 2.9
Global reduced system
(5.15) updates and 715 54 58
factorizations
Global reduced system
24945 11869 10068
(5.15) solutions
Global reduced system
37276 38724 34553
(5.15) RHS evaluations
Satellite reduced system
(5.14) updates and 104390 (715)3 7225 (50) 10492 (72)
factorizations
Satellite reduced system
3641970 (24945) 1764638 (12087) 863299 (5913)4
(5.14) solutions
Satellite reduced system
5442296 (37276) 5653704 (38724) 5044738 (34553)
(5.14) RHS evaluations
Injector system
updates, factorizations,
14734005 (714) 1340581 (65) 1877431 (91)
and Schur-comple-
ment computations
Injector system solutions 514041615 271106152 69548423 (3374)
(24943) (13155)
Injector system
1279715307 1066247854 259914368
RHS evaluations
(62097) (51739) (12612)
matrices is forced to ensure the use of the most accurate linear equivalent model possible.
Thus, in cases with increased number of switches (many marginally latent components), the
number of matrix updates and factorizations might increase compared to Config. II. Nev-
ertheless, the significantly fewer number of Satellite reduced system and injector solutions,
compensates for this increase and provides speedup.
Finally, the number of Global and Satellite reduced system RHS evaluations increases in
Config. II and decreases in IIIa. For the first increase, when the localization techniques are
used, the algorithm requires on average more iterations per time instant to converge, thus
more RHS evaluations to guarantee the algorithm’s convergence. However, in Config. IIIa,
several of the Satellite sub-domain models are replaced by the linear equivalents and their
RHS after each iteration solution is zero.
[AFV13a] P. Aristidou, D. Fabozzi, and T. Van Cutsem, “Exploiting Localization for Faster
Power System Dynamic Simulations,” in Proc. of 2013 IEEE PES PowerTech
conference, Grenoble, Jun. 2013. 1.5, 4.4
[AFV14] ——, “A Schur Complement Method for DAE Systems in Power System Dy-
namic Simulations,” in Domain Decomposition Methods in Science and Engi-
neering XXI, ser. Lecture Notes in Computational Science and Engineering,
J. Erhel, M. J. Gander, L. Halpern, G. Pichot, T. Sassi, and O. Widlund, Eds.
Springer International Publishing, 2014, vol. 98, pp. 719–727. 1.5
181
182 BIBLIOGRAPHY
[AT90] A. Abur and P. Tapadiya, “Parallel state estimation using multiprocessors,” Elec-
tric Power Systems Research, vol. 18, no. 1, pp. 67–73, Jan. 1990. 2.4.2
[AV14b] ——, “Algorithmic and computational advances for fast power system dynamic
simulations,” in Proc. of 2014 IEEE PES General Meeting, Washington DC, Jul.
2014. 1.5, 4.4
[AV14c] ——, “Parallel Computing and Localization Techniques for Faster Power Sys-
tem Dynamic Simulations,” in Proceedings of 2014 CIGRE Belgium conference,
Brussels, 2014. 1.5
[BDM73] C. G. Broyden, J. E. Dennis, and J. J. Moré, “On the Local and Superlinear
Convergence of Quasi-Newton Methods,” IMA Journal of Applied Mathematics,
vol. 12, no. 3, pp. 223–245, 1973. 4.7, A.1
[BDP96] K. Burrage, C. Dyke, and B. Pohl, “On the performance of parallel waveform
relaxations for differential systems,” Applied Numerical Mathematics, vol. 20,
no. 1-2, pp. 39–55, Feb. 1996. 3.3.3
[Bos95] A. Bose, “Parallel solution of large sparse matrix equations and parallel power
flow,” IEEE Transactions on Power Systems, vol. 10, no. 3, pp. 1343–1349,
1995. 2.4.3
[BTS96] S. Bernard, G. Trudel, and G. Scott, “A 735 kV shunt reactors automatic switch-
ing system for Hydro-Quebec network,” IEEE Transactions on Power Systems,
vol. 11, no. 4, pp. 2024–2030, 1996. 1.3.2
[BWJ+ 12] W. Baijian, G. Wenxin, H. Jiayi, W. Fangzong, and Y. Jing, “GPU based paral-
lel simulation of transient stability using symplectic Gauss algorithm and pre-
conditioned GMRES method,” in Proceedings of 2012 Power Engineering and
Automation (PEAM) conference, Wuhan, Sep. 2012. 2.4.1
[Cat04] E. Catinas, “The inexact, inexact perturbed, and quasi-Newton methods are
equivalent models,” Mathematics of Computation, vol. 74, no. 249, pp. 291–302,
Mar. 2004. A.2
[CB93] J. Chai and A. Bose, “Bottlenecks in parallel algorithms for power system stabil-
ity analysis,” IEEE Transactions on Power Systems, vol. 8, no. 1, pp. 9–15, Feb.
1993. 2.4.2, 2.5.1, 3.3.2
[CDC02] K. Chan, R. Dai, and C. Cheung, “A coarse grain parallel solution method for
solving large set of power systems network equations,” in Proceedings of 2002
International Conference on Power System Technology (PowerCon), vol. 4,
no. 1, Kunming, 2002, pp. 2640–2644. 2.4.2, 3.3.2
[Cha95] K. Chan, “Efficient heuristic partitioning algorithm for parallel processing of large
power systems network equations,” IEE Proceedings - Generation, Transmis-
sion and Distribution, vol. 142, no. 6, p. 625, 1995. 3.3.1
[Cha01a] ——, “Parallel algorithms for direct solution of large sparse power system matrix
equations,” IEE Proceedings - Generation, Transmission and Distribution, vol.
148, no. 6, p. 615, 2001. 2.4.2, 3.3.2
[CI90] M. Crow and M. Ilic, “The parallel implementation of the waveform relaxation
method for transient stability simulations,” IEEE Transactions on Power Sys-
tems, vol. 5, no. 3, pp. 922–932, Aug. 1990. 2.4.2, 3.3.3, 3.3.3
[Cie13] S. Cieslik, “GPU Implementation of the Electric Power System Model for Real-
Time Simulation of Electromagnetic Transients,” in Proceedings of the 2nd Inter-
national Conference on Computer Science and Electronics Engineering (ICC-
SEE 2013), Paris, 2013. 2.4.1
[CIW89] M. Crow, M. Ilic, and J. White, “Convergence properties of the waveform re-
laxation method as applied to electric power systems,” in Proceedings of 1989
IEEE International Symposium on Circuits and Systems, no. 4, 1989, pp. 1863–
1866. 3.3.3, 3.3.3
BIBLIOGRAPHY 185
[CJV07] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: Portable Shared
Memory Parallel Programming. MIT Press, 2007. 2.6.2, 2.6.2, 2.6.3
[CRTT11] CRSA, RTE, TE, and TU/e, “D4.1: Algorithmic requirements for simulation
of large network extreme scenarios,” Tech. Rep., 2011. [Online]. Available:
http://www.fp7-pegase.eu/ 4, 3.3.1, 3.3.3
[CZBT91] J. Chai, N. Zhu, A. Bose, and D. Tylavsky, “Parallel Newton type methods for
power system stability analysis using local and shared memory multiproces-
sors,” IEEE Transactions on Power Systems, vol. 6, no. 4, pp. 1539–1545, 1991.
2.4.3, 3.3.2
[DFK96] ——, “Conjugate gradient methods for power system dynamic simulation on
parallel computers,” IEEE Transactions on Power Systems, vol. 11, no. 3, pp.
1218–1227, 1996. 2.4.2
[DS72] H. Dommel and N. Sato, “Fast Transient Stability Soultions,” IEEE Transactions
on Power Apparatus and Systems, vol. PAS-91, no. 4, pp. 1643–1650, Jul. 1972.
1.2.5
186 BIBLIOGRAPHY
[DS11] H. Dag and G. Soykan, “Power flow using thread programming,” in Proceedings
of 2011 IEEE PES PowerTech conference, Trondheim, Jun. 2011. 2.4.3
[FCHV13] D. Fabozzi, A. S. Chieh, B. Haut, and T. Van Cutsem, “Accelerated and Local-
ized Newton Schemes for Faster Dynamic Simulation of Large Power Systems,”
IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4936–4947, Nov. 2013.
1.2.5.2, 4.1
[FCPV11] D. Fabozzi, A. S. Chieh, P. Panciatici, and T. Van Cutsem, “On simplified han-
dling of state events in time-domain simulation,” in Proc. of 17th Power System
Computational Conference (PSCC), Stockholm, 2011. 1.2.4, C.3.1
[FM04] J. Fung and S. Mann, “Using multiple graphics cards as a general purpose par-
allel computer: applications to computer vision,” in Proceedings of the 17th In-
ternational Conference on Pattern Recognition (ICPR 2004), Cambridge, 2004,
pp. 805–808. 2.4.1
[FMT13] C. Fu, J. D. McCalley, and J. Tong, “A Numerical Solver Design for Extended-
Term Time-Domain Simulation,” IEEE Transactions on Power Systems, vol. 28,
no. 4, pp. 4926–4935, Nov. 2013. 1.2.5
[FP78] J. Fong and C. Pottle, “Parallel Processing of Power System Analysis Problems
Via Simple Parallel Microcomputer Structures,” IEEE Transactions on Power Ap-
paratus and Systems, vol. PAS-97, no. 5, pp. 1834–1841, Sep. 1978. 2.4.2
[FX06] D. Fang and Y. Xiaodong, “A New Method for Fast Dynamic Simulation of Power
Systems,” IEEE Transactions on Power Systems, vol. 21, no. 2, pp. 619–628,
May 2006. 3.3.2
[Gar10] N. Garcia, “Parallel power flow solutions using a biconjugate gradient algorithm
and a Newton method: A GPU-based approach,” in Proceedings of 2010 IEEE
PES General Meeting, Minneapolis, Jul. 2010. 2.4.1
[GHN99] M. Gander, L. Halpern, and F. Nataf, “Optimal convergence for overlapping and
non-overlapping Schwarz waveform relaxation,” in Proceedings of 11th Interna-
tional Conference on Domain Decomposition Methods, Greenwich, 1999, pp.
27–36. 3.3.3
188 BIBLIOGRAPHY
[GJY+ 12] C. Guo, B. Jiang, H. Yuan, Z. Yang, L. Wang, and S. Ren, “Performance Com-
parisons of Parallel Power Flow Solvers on GPU System,” in Proceedings of
2012 IEEE International Conference on Embedded and Real-Time Computing
Systems and Applications (RTCSA), Seoul, Aug. 2012, pp. 232–239. 2.4.1
[GNV07] A. Gopal, D. Niebur, and S. Venkatasubramanian, “DC Power Flow Based Con-
tingency Analysis Using Graphics Processing Units,” in Proceedings of 2007
IEEE PowerTech conference, Lausanne, Jul. 2007, pp. 731–736. 2.4.1
[Gov10] D. Gove, Multicore Application Programming: For Windows, Linux, and Oracle
Solaris. Addison-Wesley Professional, 2010. 2.1, 2.5.2, 3, 4.9.4.2
[GSAR09] J. Giri, D. Sun, and R. Avila-Rosales, “Wanted: A more intelligent grid,” IEEE
Power and Energy Magazine, vol. 7, pp. 34–40, 2009. 4.9.4.3
[GT82] A. Griewank and P. L. Toint, “Local convergence analysis for partitioned quasi-
Newton updates,” Numerische Mathematik, vol. 39, no. 3, pp. 429–448, Oct.
1982. 4.7, A.1
[GWA11] R. C. Green, L. Wang, and M. Alam, “High performance computing for electric
power systems: Applications and trends,” in Proceedings of 2011 IEEE PES
General Meeting, Geneva, 2011. 5.1
BIBLIOGRAPHY 189
[Hed14] W. F. Hederman, “IEEE Joint Task Force on Quadrennial Energy Review,” IEEE,
Tech. Rep., 2014. 2.5.3
[HP00] I. Hiskens and M. Pai, “Trajectory sensitivity analysis of hybrid systems,” IEEE
Transactions on Circuits and Systems I: Fundamental Theory and Applications,
vol. 47, no. 2, pp. 204–220, 2000. 1.2.4
[HR88] M. Haque and A. Rahim, “An efficient method of identifying coherent generators
using Taylor series expansion,” IEEE Transactions on Power Systems, vol. 3, pp.
1112–1118, 1988. 3.3.1
[HSL14] HSL, “A collection of Fortran codes for large scale scientific computation.”
2014. [Online]. Available: http://www.hsl.rl.ac.uk/ 4.9, 5.8
[HW96] E. Hairer and G. Wanner, Solving Ordinary Differential Equations II, 2nd ed., ser.
Springer Series in Computational Mathematics. Berlin, Heidelberg: Springer
Berlin Heidelberg, 1996. 1.2.2.1
190 BIBLIOGRAPHY
[ILM98] F. Iavernaro, M. La Scala, and F. Mazzia, “Boundary values methods for time-
domain simulation of power system dynamic behavior,” IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications, vol. 45, no. 1,
pp. 50–63, 1998. 3.3.3
[IS90] M. Irving and M. Sterling, “Optimal network tearing using simulated annealing,”
IEE Proceedings C (Generation, Transmission and Distribution), vol. 137, no. 1,
pp. 69–72, 1990. 3.3.1
[JW01] Y.-L. Jiang and O. Wing, “A note on convergence conditions of waveform relax-
ation algorithms for nonlinear differential-algebraic equations,” Applied Numeri-
cal Mathematics, vol. 36, no. 2-3, pp. 281–297, Feb. 2001. 3.3.3
[KB00] B. Kim and R. Baldick, “A comparison of distributed optimal power flow algo-
rithms,” IEEE Transactions on Power Systems, vol. 15, no. 2, pp. 599–604, May
2000. 2.4.2
[KD13] H. Karimipour and V. Dinavahi, “Accelerated parallel WLS state estimation for
large-scale power systems on GPU,” in Proceedings of 2013 North American
Power Symposium (NAPS), Manhatan, Sep. 2013. 2.4.1
BIBLIOGRAPHY 191
[Kel95] C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations. SIAM, 1995.
A.1
[KM09] S. K. Khaitan and J. McCalley, “Fast parallelized algorithms for on-line extended-
term dynamic cascading analysis,” in Proceedings of 2009 IEEE PES Power
Systems Conference and Exposition (PSCE), Seattle, Mar. 2009. 2.4.2
[KRF94] ——, “A parallel Gauss-Seidel algorithm for sparse power system matrices,” in
Proceedings of the 1994 ACM/IEEE conference on Supercomputing, New York,
1994. 2.4.2
[Kun94] P. Kundur, Power system stability and control. McGraw-hill New York, 1994.
1.2
[LB93] M. La Scala and A. Bose, “Relaxation/Newton methods for concurrent time step
solution of differential-algebraic equations in power system dynamic simula-
tions,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and
Applications, vol. 40, no. 5, pp. 317–330, May 1993. 1.2.2, 2.4.2
[LBTC90] M. La Scala, A. Bose, D. Tylavsky, and J. Chai, “A highly parallel method for
transient stability analysis,” IEEE Transactions on Power Systems, vol. 5, no. 4,
pp. 1439–1446, Nov. 1990. 3.3.3
[LDG+ 12] P. Li, C. Ding, F. Gao, H. Yu, X. Guo, Y. Zhou, and C. Wang, “The parallel
algorithm of transient simulation for distributed generation powered micro-grid,”
in Proceedings of 2012 IEEE PES Innovative Smart Grid Technologies (ISGT),
Tianjin, May 2012. 2.4.3
192 BIBLIOGRAPHY
[LDTY11] Z. Li, V. D. Donde, J.-C. Tournier, and F. Yang, “On limitations of traditional
multi-core and potential of many-core processing architectures for sparse linear
solvers used in large-scale power system applications,” in Proceedings of 2011
IEEE PES General Meeting, Detroit, Jul. 2011. 2.4.1, 2.4.3
[LJ15] Y. Liu and Q. Jiang, “Two-Stage Parallel Waveform Relaxation Method for
Large-Scale Power System Transient Stability Simulation,” IEEE Transactions
on Power Systems, pp. 1–10, 2015. 3.3.3
[LL14] X. Li and F. Li, “GPU-based power flow analysis with Chebyshev preconditioner
and conjugate gradient method,” Electric Power Systems Research, vol. 116,
pp. 87–93, Nov. 2014. 2.4.1
[MBB08] J. Machowski, J. Bialek, and J. Bumby, Power system dynamics: stability and
control. JohnWiley & Sons, 2008. 1.2, 1.2.2, 1.2.2.1, 1.2.2.1, 1.2.5, 1.2.5.1,
1.2.5.2, 1.2.5.2, 4.4, 5.3.4
[McK04] S. A. McKee, “Reflections on the memory wall,” in Proceedings of the 1st con-
ference on computing frontiers on Computing frontiers, New York, 2004, p. 162.
2.1
[Mil10] F. Milano, Power System Modelling and Scripting, ser. Power Systems. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2010. 1.2.2.1, 1.2.5.1, 1.2.5.2, 1.2.5.2,
A.1
[NAA11] K. M. Nor and M. Abdel-Akher, “Parallel three-phase load flow analysis for large
scale unbalanced distribution networks,” in Proceedings of 2011 International
Conference on Power Engineering, Energy and Electrical Drives, Malaga, May
2011. 2.4.3
[OKS90] T. Oyama, T. Kitahara, and Y. Serizawa, “Parallel processing for power system
analysis using band matrix,” IEEE Transactions on Power Systems, vol. 5, no. 3,
pp. 1010–1016, 1990. 2.4.2
[OO96] Y. Oda and T. Oyama, “Fast calculation using parallel processing and pipeline
processing in power system analysis,” Electrical Engineering in Japan, vol. 116,
no. 5, pp. 85–96, 1996. 2.4.2
[QH13] Z. Qin and Y. Hou, “A GPU-Based Transient Stability Simulation Using Runge-
Kutta Integration Algorithm,” International Journal of Smart Grid and Clean En-
ergy, vol. 2, no. 1, pp. 32–39, 2013. 2.4.1
[RR13] T. Rauber and G. Rünger, Parallel programming: For multicore and cluster sys-
tems. Springer-Verlag Berlin Heidelberg, 2013. 2.2.3
[SA10] J. Singh and I. Aruni, “Accelerating Power Flow studies on Graphics Process-
ing Unit,” in Proceedings of 2010 Annual IEEE India Conference (INDICON),
Kolkata, Dec. 2010. 2.4.1
[Saa03] Y. Saad, Iterative methods for sparse linear systems, 2nd ed. Society for In-
dustrial and Applied Mathematics, 2003. 3.1, 3.2.1, 3.2.3.1, 3.2.3.1, 3.2.3.2,
5.4
[Sch08] J. Schlabbach, “Low voltage fault ride through criteria for grid connection of
wind turbine generators,” in Proceedings of 5th International Conference on the
European Electricity Market, Lisboa, May 2008. 5.7, 5.8.1
[SL85] A. Saleh and M. Laughton, “Cluster analysis of power-system networks for array
processing solutions,” IEE Proceedings C Generation, Transmission and Distri-
bution, vol. 132, no. 4, p. 172, 1985. 2.4.2
[SVCC77] A. Sangiovanni-Vincentelli, L.-K. Chen, and L. Chua, “An efficient heuristic clus-
ter algorithm for tearing large-scale networks,” IEEE Transactions on Circuits
and Systems, vol. 24, 1977. 3.3.1
[SXZ05] J. Shu, W. Xue, and W. Zheng, “A Parallel Transient Stability Simulation for
Power Systems,” IEEE Transactions on Power Systems, vol. 20, no. 4, pp. 1709–
1717, Nov. 2005. 2.4.2, 3.3.2
[TAT81] H. Taoka, S. Abe, and S. Takeda, “Multiprocessor system for power system
analysis,” Annual Review in Automatic Programming, vol. 11, pp. 101–106, Jan.
1981. 2.4.2
[TDL11] J.-C. Tournier, V. Donde, and Z. Li, “Potential of General Purpose Graphic Pro-
cessing Unit for Energy Management System,” in Proceedings of 6th Interna-
tional Symposium on Parallel Computing in Electrical Engineering, Luton, Apr.
2011, pp. 50–55. 2.4.1
[TW67] W. Tinney and J. Walker, “Direct solutions of sparse network equations by opti-
mally ordered triangular factorization,” Proceedings of the IEEE, vol. 55, no. 11,
pp. 1801–1809, 1967. A.1
[UD13] F. M. Uriarte and C. Dufour, “Multicore methods to accelerate ship power system
simulations,” in Proceedings of 2013 IEEE Electric Ship Technologies Sympo-
sium (ESTS), Arlington, Apr. 2013, pp. 139–146. 2.4.3
[VCB92] G. Vuong, R. Chahine, and S. Behling, “Supercomputing for power system anal-
ysis,” IEEE Computer Applications in Power, vol. 5, no. 3, pp. 45–49, Jul. 1992.
2.4.2
[VGL06] T. Van Cutsem, M. E. Grenier, and D. Lefebvre, “Combined detailed and quasi
steady-state time simulations for large-disturbance analysis,” International Jour-
nal of Electrical Power and Energy Systems, vol. 28, pp. 634–642, 2006. C.3.1
[VMMO11] C. Vilacha, J. C. Moreira, E. Miguez, and A. F. Otero, “Massive Jacobi power flow
based on SIMD-processor,” in Proceedings of 10th International Conference on
Environment and Electrical Engineering, Rome, May 2011. 2.4.1
[WC76] Y. Wallach and V. Conrad, “Parallel solutions of load flow problems,” Archiv für
Elektrotechnik, vol. 57, no. 6, pp. 345–354, Nov. 1976. 2.4.2
[WEC14] “Wind Power Plant Dynamic Modeling Guide,” Western Electricity Coordinating
Council (WECC), Tech. Rep., 2014. 1.3.1
[XCCG10] Y. Xu, Y. Chen, L. Chen, and Y. Gong, “Parallel real-time simulation of Integrated
Power System with multi-phase machines based on component partitioning,” in
Proceedings of 2010 International Conference on Power System Technology,
Zhejiang, Oct. 2010. 2.4.3
[YXZJ02] L. Yalou, Z. Xiaoxin, W. Zhongxi, and G. Jian, “Parallel algorithms for transient
stability simulation on PC cluster,” in Proceedings of 2002 International Confer-
ence on Power System Technology, vol. 3, no. 1, Kunming, 2002, pp. 1592–
1596. 2.4.2, 3.3.2
[ZCC96] X.-P. Zhang, W.-J. Chu, and H. Chen, “Decoupled asymmetrical three-phase
load flow study by parallel processing,” IEE Proceedings - Generation, Trans-
mission and Distribution, vol. 143, no. 1, p. 61, 1996. 2.4.2