ISPCS 2008 – International IEEE Symposium on Precision Clock
Synchronization for Measurement, Control and Communication
Ann Arbor, Michigan, September 22–26, 2008
Master Failures in the Precision Time Protocol
Georg Gaderer1, Stefano Rinaldi2, and Nikolaus Kerö3
1
Research Unit for Integrated Sensor Systems, Austrian Academy of Sciences
2
University of Brescia
3
Oregano Systems Design and Consulting GmbH
Keywords – Fault tolerance, Clock Synchronization, Computer
Networks, IEEE Keywords
PTP is based on a master/slave principle in a way that
once a master, which
has been previously elected
synchronizes its slaves via multicast messages2. However, for
a considerable number of applications even a temporary
failure of the clock synchronization is by no means
acceptable. The PTP protocol handles recovery from a failure
by means of providing the so called best master clock (BMC)
algorithm; during this phase all slaves within a
synchronization (or multicast) domain remain with free
running unsynchronized clocks, yet electing a new master.
This paper proposes an approach, where multiple masters are
tied together to a so-called mastergroup, where one or more
masters may fail without any of the nodes noticing the
failure, thus the synchronization accuracy will no be
deteriorated.
The remainder of this paper is structured as follows: After
an analysis of the state of the art, namely the master election
process in IEEE1588, the approach to synchronize within the
group is elaborated and the proposed system shown within a
simulation experiment. Finally a conclusion will round up the
paper and give an outlook for future research.
INTRODUCTION
STATE OF THE ART
Clocks1 representing the same notion of time have many
advantages in distributed systems, the most obvious being the
possibility to set coordinated actions such as synchronized
communication. This can be used to establish real-time,
which compulsory for TDMA schemes. Another application
for synchronized clocks is the identification, ordering, and
quantization in terms of timing of events in a distributed
system. Again, the applications for this approach are wide
spread; one very famous are the LAN eXtensions for
Instrumentation (LXI) [1,2], where test and measurement
devices are synchronized over Ethernet in order to
conveniently setup a-posteriori triggering.
Approaches to reach these synchronicity requirements are
well observed and take usually advantage of communication
networks. Synchronization is done by periodically exchanged
messages to align the clock w. r. t. each other. Synchronizing
clocks this way is often used, as for example in the internetstandard Network Time Protocol (NTP), or in the more
accurate IEEE1588 [2] Precision Time Protocol (PTP)
standard.
State of the art clock synchronization techniques can use
two different paradigms: the master/slaved based principle
and the democratic approach. The first method elects one
dedicated master in order to synchronize all other nodes. In
opposite to that, democratic algorithms use the clocks of
several nodes, which are then combined to an agreed clock
value. Obviously both approaches have advantages and
disadvantages; master/slave based clock synchronization is
easy to implement and to debug, the dependencies within an
environment running such a protocol are simple. On the other
hand, democratic approaches are adding a certain degree of
complexity to a system but have the advantage that they can
offer fault tolerance. Faulty or malfunctioning clocks can be
sorted out without sacrificing even short-term accuracy.
Abstract – If all clocks within a distributed system share the same
notion of time, the application domain can gain several advantages.
Among those is the possibility to implement real-time behavior,
accurate time stamping, and event detection. However, with the
wide spread application of clock synchronization another topic has
to be taken into consideration: the fault tolerance. The well known
clock synchronization protocol IEEE1588 (Precision Time Protocol,
PTP), is based on a master/slave principle, which has one severe
disadvantage. This disadvantage is the fact that the failure of a
master automatically requires the re-election of a new master. The
start of a master election based on timeout and thus takes a certain
time span during which the clocks are not synchronized and thus
running freely. Moreover the usage of a new master also requires
new delay measurements, which prolong the time of uncertainty as
well.
This paper analyzes the results of such a master failure and
proposes democratic master groups instead of hot-stand-by masters
to overcome this problem by. It is shown by means of simulation that
the proposed solution will not deteriorate the accuracy of the slave
clocks in case of a master failure.
A. IEEE 1588
The basics of IEEE1588-2002 and version 2008 are well
specified of course in the respective standard document [2].
However, secondary literature is available as well, giving an
overview [3]. As this paper focuses on the fault case that a
1
The work presented in this paper is partly funded by the
European Fund for Regional Development (EFRE)
978-1-4244-2275-3/08/$25.00 ©2008 IEEE
2
Version 2008 of the IEEE1588 standard (approved at the time of
submission of this paper) allows synchronization via unicast messages as
well.
59
master’s clock is considered, as wrong, special attention has
to be given to the following master election process.
hard- and software. For example, one major influencing
factor to this interval is the quality of the clock’s oscillator.
Depending on the class a degradation of the value over time
has to be considered. Other major influencing factors like the
aging can be found in [5].
The oval framed part in the figure shows the application of
the convergence function. As already noted several proposals
for this convergence functions have been discussed in
literature. The proposals range from simple averaging of all
values, which is obviously the best choice to take advantage
of statistical distributions of clock readings to Marzullo class
operations, where not only the actual value of the clock, but
also its own estimation of the accuracy is considered [4].
Figure 1: State transition diagram of the PTP protocol
The standard itself defines for this master election a state
collapsed transition diagram rather than a flattened state
diagram. A simple election case as shown in figure 1, which
is derived from the PTP specification. It is evident from this
figure that, in case a master fails, a node formerly in slave
mode has to wait at least one synchronization interval to
fallback in listening (Li,j) state and then two additional
synchronization timeouts until a new master can be elected.
The unsynchronized time is therefore with the IEEE1588
default synchronization interval (2s), 6 seconds. This
estimation does however not cover the time of the servo to
recover from the time without control information. The
situation is even worth because the slave node has to wait
some additional, randomized time until it can initiate a delay
measurement. Until this time is reached only a guess of the
line delay can be taken into account.
Figure 2: Principle of the convergence- (often also referred as
agreement-) function
For the remainder of this paper, a convergence function
which is not based on accuracies is used. It is also very well
known under the name Fault Tolerant Average (FTA)
proposed by Welch and Lydelius [9]. The selection function
simply combines the broadcasted clock values in two steps.
First the values of all received clocks are sorted. The clocks
with highest and the lowest value are sorted out and
considered no more. In a second step the remaining values of
the clocks are taken to calculate the mean value. This
algorithm has the advantage of being simple, however, it
suffers from the disadvantage that only two faulty clocks can
be successfully sorted out. In case of more failure conditions
wrong clock values influence the result of the averaging of
the clocks. Finally, another disadvantage of the proposed
algorithm has to be mentioned: in order to ensure sufficient
fault tolerance at least four democratic members have to be
set up in order to have enough clock values to sort out. For
the investigations of this paper, namely the failure of a single
master node in these disadvantages can be neglected
B. Democratic algorithms
The principle of democratic clock synchronization
algorithms is pretty much in contrast to the approach of
IEEE1588: In general this type of synchronization approach
considers all clocks rather than the value of a single dedicated
master. This technique is implemented in a fashion that all
democratic synchronizers broadcast their clock values at
predefined times. A so called convergence function is then
used in sequel in order to find an agreement on the common
time.
In terms of the convergence function several approaches
have been proposed, an extensive overview can be found in
[4]. Figure 2 shows an application of such a convergence
function. As depicted every node runs free in its pure phase,
the dot-dashed line represents the clock value, whereas the
dashed lines refer to the accuracy interval. The start-size as
well as the growth of this interval is defined by the system’s
60
PROBLEM ANALYSIS AND PROPOSED SOLUTION
for large networks. Furthermore, fully democratic networks
suffer from a highly increased communication load as every
node necessarily needs to broadcast its clock value to a given
time to all others. Such a communication requirement is not
always easy to meet in networks used for low cost embedded
systems.
On the other hand, a solution based on more intelligent
slaves could be developed. In this case each slave has to be
able to synchronize its local clock using time information
collected from more than a master. This solution has some
problems that reduce its usage. In fact a dedicated protocol,
that modifies the PTP stack, has to be implemented in each
node that composes the fault tolerant network in order to
allow the exchange of the time data among the masters.
Generally speaking, this is not feasible for several reasons.
Pre-existent PTP nodes in a plant cannot be connected to such
networks because they are not able to work in a multi-master
environment. Hence, all nodes have to feature dedicated
hard-and software, and are thus more expensive, devices.
Therefore, a hybrid approach for Ethernet, which was already
proposed in previous publications, is used [8]. This approach
foresees a group of high performance masters, synchronizing
themselves using a democratic scheme. This group appears as
one single (ordinary) master if viewed from the perspective
of an IEEE1588 slave, as shown in figure 4. The network
element which is supposed to keep the agreed time and
transmit it by means of multicast synchronization packets to
all slaves is the switch, which has to be designed with special
care on fault tolerance, by duplicating critical hardware. This
is also reasonable because the switch is the single point of
failure in an Ethernet network. In contrast to the previous
work this approach directly tackles the problem of master
failure and is moreover simplified by accuracy-less clock
synchronization. Furthermore the dedicated democratic
synchronization stack is implemented only in the high
performance master group sub-network, whereas the rest of
the network can use the standard PTP stack. Thus the fault
tolerant network can be achieved by means of standard PTP
devices, even already present on the plant, without the needed
to develop ad-hoc devices at slave level. The democratic
stack is derived from IEEE1588 protocol and keeps as more
as possible compatibility with the standard. For instance the
time information among the democratic nodes are exchanged
by means of synchronization messages which structure is the
same of PTP sync messages. For additional information about
the democratic stack [12] has to be mentioned.
The PTPv1 standard enables clock synchronization with
high accuracy limiting both, computational effort of devices
and network bandwidth utilization. Speaking in terms of
synchronization no additional management of the nodes is
required. For this reason it is widely used in low-cost sensor
network applications. However, the master/slave hierarchy,
implemented by the PTP stack to simplify the data exchange,
is the main weakness, as described in the previous section.
Figure 3 shows a simulation of a Master/Slave based
PTPv1 system, where after a long settling in period (greater
1,5 hours) an artificial error has been injected into the active
master, which triggers the master election. In detail, this error
was forced by cutting off the communication link of the
master, so it stops to transmit timing information to slave
nodes. The simulated network is composed of 4 PTP nodes, a
master and 3 slaves. The synchronization interval is 2 s. For
the simulation parameters typical values for a low cost real
world system have been taken, which have proven to produce
reasonable results [10]. It can be seen that the maximum error
is around 20 µs during the approximately 90 second failureperiod. The relatively high offset (between simulation second
5720 and 5760 (5780) is due to the lack of a delay
measurement and a guess for the communication delay. The
2008 version of IEEE 1588 standard provides for a shorter
synchronization interval. Thus the maximum time offset
among the clocks due to master fault can be reduced but not
completely removed, because during the master election thee
nodes are still running freely.
x 10
Simulated Fault of a IEEE 1588 Master
-6
PTP Clock 0 (Master 1)
PTP Clock 1 (Master 2)
PTP Clock 2
PTP Clock 3
20
Clock Offset (s)
15
10
5
0
-5
5820
5840
5860
5880
5900
5920
5940
5960
5980
6000
Simulation Time (s)
Figure 3: Behavior of IEEE1588 v1 slaves in case of master failure
The problem with master failures in IEEE 1588 networks
is quite obvious. It can be tackled using several approaches.
On the one hand, a whole network could be synchronized by
a democratic approach. However, this turns out to result in
several drawbacks. First of all it is convenient to have
dedicated masters which having fault tolerant hardware or at
least for synchronization more suited hardware, such as
OCXOs, TCXOs or even rubidium or GPS sourced clocks.
To equip all nodes with expensive hardware is not justifiable
61
The test of the master group approach described in the
previous section has been performed by means of a
simulation framework for synchronization already presented
and evaluated in [8]. In this paper the framework has been
adapted and configured in order to test the behavior of
synchronization, (PTPv1, democratic and master group) in
case of a single master fault.
B. Simulation Setup for Master Fault Evaluation
Figure 5 shows the network used to test the algorithm.
The network itself is composed by two different subnetworks: the first implements a democratic algorithm for
synchronization, the second a master-slave approach (IEEE
1588). The sub-networks are in this setup interconnected by
means of a switch. This solution allows to perform several
tests by simply changing the configuration parameters of the
same network topology.
For instance, it is possible to test the synchronization
accuracy of PTP or democratic algorithms concurrently on
the same physical network using different multicast groups or
to test the master group approach by joining the two subnetworks. The latter is done by so called master group
speaker, which is the aforementioned a fault tolerant switch.
During implementation of a framework for testing
synchronization techniques great attention has to be paid to
the development of oscillator model. The parameters assigned
to this model are typical values of a low cost oscillator for
sensor networks [10]:
• Oscillator Frequency: 10 MHz
• Oscillator class:
100 ppm
• Aging Factor:
1 10-10 1/day
The reliability of simulation results can be improved by a
more accurate oscillator model which could considers also
the stochastic phenomena, like that described in [6]. The
experiments presented in this paper are used to test the
behavior of the master group approach during a group
member fault condition. In this paper in particular the
problem of permanent master failure is handled. For this
reason the network is composed by 6 nodes for the
democratic and 4 for the PTP sub-network. The number of
nodes for the democratic backbone is implicitly defined by
the FTA algorithm. As this algorithm removes two nodes, at
least three have to remain in order to obtain useful results.
This way the synchronization accuracy obtained by the two
techniques can be directly compared.
Figure 4: Master Group Approach
A. Simulation Environment
The analysis of the behavior of the synchronization algorithm
during fault conditions should be made in a controlled and
repeatable environment, which is perfectly met when using a
simulator. For several reason it is not easy to make all
feasibility tests in a real network. First of all it is not possible
to stop a full fledged user network for such experiments,
because it has to be avoided to stop the production just for an
experiment. This makes large-scale experiments almost
impossible. On the other hand it is too expensive and it would
take too much time implementing a dedicated experimental
network with hundred of nodes linked among each other with
a special topology. Moreover, it is difficult in such a network
to create the fault condition in a repeatable way, in order to
record the timing data. Great part of this problem could be
solved using a simulation environment. In fact the user can
easily modify the number of nodes of the network under test,
its topology, the complexity of model and adapt the
simulation time to its needs. Also it is easy and often helpful
to reduce the simulation speed, when it is important to catch
the details or increasing in order to reduce the computational
effort.
A discrete event simulator, OMNeT++ [7] (available
under license free for academic use under the Academic
Public License, APL) has been chosen because it is a open
source environment and it is easy to adapt the model to
describe several devices. This simulation tool allows
development of network device models using C++ code. The
interconnections among the sub-models in OMNeT++ is
defined by means of a topology description language, the
NED language. The developer can decide on the complexity
level of each model in order to fit the requirements needed by
the application without overloading the computational
capability of the simulation host. For instance, it is possible to
create a model for both, the hard- and software part of the
system in case a great accuracy is needed or to simplify
otherwise. This feature is very useful during the
implementation of a synchronization simulation framework,
where it is very important to select the right level of detail of
each system part.
62
DemClk_0 DemClk_1 DemClk_2 DemClk_3 DemClk_4 DemClk_5
Distribution of the Clock offset around the Failure Event (PTP Clock 1)
Fault Tolerant Switch
PTPClk_0
PTPClk_1
PTPClk_2
Probability
0.8
PTPClk_3
Democratic subnetwork
0.6
0.4
0.2
PTP subnetwork
Figure 5: Simulation setup for master fault evaluation
0
-6
-4
5000
-2
5200
SIMULATION RESULTS
0
5400
5800
The simulation environment addressed in the previous
section is subsequently used to setup the following case: A
master group is used to synchronize a group of PTP slaves.
By means of error injection one of the masters is set to a
faulty state. The assumed fault model for this changes the
jitter of the node and clock-rate as well as the ability to
control increment. Figure 6 shows the clock values of this
master group member before and after the failure. For
illustration purposes the jitter is shown via a sliding window
function which is applied piecewise over 100 two secondsamples of the offset to absolute ideal simulation timescale.
The dotted (red) line in this picture denotes the time window
in which the failure condition is triggered. It can be clearly
seen how the single clock slides away from the rest of the
synchronization domain. The clock offset is aligned to the
ideal master group agreement time, i.e., the virtual master
time for all slaves.
4
Clock Offset (s)
Window Position (s)
Figure 7: Behavior of the IEEE1588 slave during a master group
member failure
The time offset of each node from the node 0 in subnetwork (democratic master group and IEEE1588 slaves) is
shown in table 1. The results are reported before and after the
failure of the democratic clock #5. It can be clearly seen how
the fault of a member of master group doesn’t change
significantly the synchronization accuracy of the IEEE 1588
slaves.
On the other hand, the failure of a democratic member has
some effect on the master group synchronization accuracy, as
shown by the slight increase of standard deviation. This way
the number of nodes in the network decreases which reduces
the filtered noise by the democratic algorithm.
Network
Clocks
DemClk 0
DemClk 1
DemClk 2
DemClk 3
DemClk 4
DemClk 5
PTPClk_0
PTPClk_1
PTPClk_2
PTPClk_3
Distribution of the Clock offset around the Failure Event (Faulty Democratic Clock 2)
1
0.8
Probability
x 10
2
5600
0.6
0.4
Offset (Before) [ns]
Mean
Std. Dev.
0
0
0
34
2
35
0
34
-1
34
6
35
0
0
-6
45
-4
43
-3
43
Offset (After) [ns]
Mean
Std. dev.
0
0
3
35
6
37
1
36
5
35
0
0
-7
42
0
44
-9
44
0.2
Table 1: Synchronization offset during a master group member failure
0
The democratic algorithm provides more synchronization
accuracy then IEEE1588 protocol, as shown in the table (the
standard deviation is 10 ns lower). As discussed in the
previous section, a more accurate oscillator model, that takes
also non-deterministic noise, has to be developed in order to
obtain more reliable results about the synchronization
accuracy from the simulator.
5000
5200
5400
5600
5800
Window Position (s)
0
-0.5
-1
-1.5
-2.5
-2
x 10
-5
Clock Offset (s)
Figure 6: Failure and drift of a Master Group Member
Far more interesting than this obvious result is the influence
of clock failures on a simple PTP slave. This condition is
shown in figure 7 where the identical setup has been used.
This qualitative analysis shows that the slave is obviously not
at all influenced by a single master failure.
63
-6
REFERENCE
ES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Figure 8: Behavior of master group member d
during power up
The simulator is also useful to analyzee the behavior of
synchronization algorithms in particular situuations, which are
difficult to cover in real-life networks. An eexample for such
an scenario is in the terms of measurementt difficult task to
cover the power-up phase of a widely distribbuted network. In
this case the simulation can assist to innvestigate and to
improve synchronization performance. For iinstance, figure 8
shows that the democratic algorithm of masster group nodes,
described in the previous section, which has a very long
convergence time at start-up. This can be a pproblem for some
applications and thus, more investigations arre needed.
[9]
[10]
[11]
[12]
[13]
CONCLUSION AND OUTLO
OOK
Especially in fault sensitive environmentss the loss of clock
synchronization can have fatal consequencees. In the case of
master/slave based clock synchronizationn as defined in
IEEE1588 it leads to a master re-election, dduring which the
slaves are not synchronized. It is shown bassed on simulation
results, that the offset during this time can rreach excessively
high values. This problem can be tackled byy introducing the
concept of master groups, a cluster of IE
EEE1588 enabled
masters, which solve this problem by synnchronizing as a
group via a democratic algorithm. In this caase the fault of a
master can be handled transparently, withouut any observable
effect at the slave-side.
mparison between
Further investigations will cover a com
several synchronization algorithms such ass those based on
Marzullo functions as well as an analysiis of the startup
behavior. Moreover, it has to be investtigated how the
combination of several masters to a group can improve the
quality of the reference which is transferred tto the slaves.
64
J. C. Eidson, The Application
n of IEEE 1588 to Test &
Measurement Systems, LXI Whiteepaper 2005
LXI Standard. LXI Consortium. www.
w
lxi-consortium.org
IEEE1588 D2.2 Active Approved
d Draft Standard for a Precision
Clock Synchronization Protocol for
f Networked Measurement and
Control Systems, IEEE, Mar 2008
8
John C. Eidson, Measurement, Control, and Communication
Using IEEE 1588 (Advances in
n Industrial Control) by April
2006, Kindle Book
Schmid, U. (ed) Special Issue on the
t Challenge of Global Time in
Large Scale in Distributed Real-T
Time Systems, 1997
Georg Gaderer, Patrick Loschm
midt, Aneta Nagy, et. Al., “An
Oscillator Model for High Preccision Synchronization Protocol
Descrete Event Simulation”, Pap
per in Print, 2007 International
Conference on Precision Time an
nd Time Intervals, Long Beach,
2007
OMNeT. OMNet++ Discretee Event Simulation System.
www.omnetpp.org
Praus F., Granzer W., Gaderer G., Sauter T. “A Simulation
Framework for Fault-Tolerantt Clock Synchronization in
Industrial Automation Networks””. In Proceedings of the IEEE
Conference on Emerging Technologies & Factory Automation
(ETFA07), pp. 1465-1472, Patras,, Greece, Sept. 2007
Jennifer Welch-Lydelius and Nan
ncy Lynch A new fault-tolerant
algorithm for clock synchronizattion, Inf. Comput. Vol 1, 1997
pp.1-36
Gaderer, G., Loschmidt, P., and Sauter, T. “Quality Monitoring
in Clock Synchronized Distributeed Systems”. In Proceedings of
the IEEE International Worksho
op on Factory Communication
Systems (WFCS06), pp. 13-21, To
orino, Italy, June 2006
Gaderer, G., Sauter, T., and Hö
öller, R. “Strategies for Clock
Synchronization in Powerline Neetworks”. In Proceedings of the
3rd International Workshop on Reeal Time Networks (RTN04), pp.
53-56, Catania, Italy, June 2004.
Gaderer, G., Holler, R., Sauter, T.. and Muhr, H. “Extending IEEE
1588 to Fault Tolerant Clock Synchronization”. In Proceedings of
the IEEE International Worksho
op on Factory Communication
Systems (WFCS04), pp. 353 – 357
7, Austria, September 2004.
Gaderer, G., Loschmidt, P. and Saauter, T. “IEEE 1588 Real-Time
Networks with Hybrid Masteer Group Enhancements”. In
Proceedings of the 4th Internattional Workshop on Real Time
Networks (RTN05), pp. 25-29, Spaain, July 2005