Conformance Testing For Stochastic Cyber-Physical Systems
Conformance is defined as a measure of distance between the behaviors of two dynamical sys-
tems. The notion of conformance can accelerate system design when models of varying fidelities
are available on which analysis and control design can be done more efficiently. Ultimately, con-
formance can capture distance between design models and their real implementations and thus
aid in robust system design. In this paper, we are interested in the conformance of stochastic
dynamical systems. We argue that probabilistic reasoning over the distribution of distances be-
tween model trajectories is a good measure for stochastic conformance. Additionally, we propose
the non-conformance risk to reason about the risk of stochastic systems not being conformant.
We show that both notions have the desirable transference property, meaning that conformant
systems satisfy similar system specifications, i.e., if the first model satisfies a desirable specifi-
cation, the second model will satisfy (nearly) the same specification. Lastly, we propose how
stochastic conformance and the non-conformance risk can be estimated from data using statis-
tical tools such as conformal prediction. We present empirical evaluations of our method on an
F-16 aircraft, an autonomous vehicle, a spacecraft, and Dubin’s vehicle.
1 Introduction
Cyber-physical systems (CPS) are usually designed using a model-based design (MBD) paradigm.
Here, the designer models the physical parts and the operating environment of the system and then
designs the software used for perception, planning, and low-level control. Such closed-loop systems
are then rigorously tested against various operating conditions, where the quality of the designed
software is evaluated against model properties such as formal design specifications (or other kinds of
quantitative objectives). Examples of such property-based analysis techniques include requirement
falsification [1–5], nondeterministic and statistical verification [6–13], and risk analysis [14, 15].
MBD is a fundamentally iterative process in which the designer continuously modifies the soft-
ware to tune performance or increase safety margins, or change plant models to perform design
space exploration [16], e.g., using model abstraction or simplification [17–19], or to incorporate
new data [20]. Any change to the system model, however, requires repeating the property-based
analyses as many times as the number of system properties. The fundamental problem that we
consider in this paper is that of conformance [21–25]. The notion of conformance is defined w.r.t.
the input-output behavior of a model. Typically, model inputs include exogenous disturbances or
user-inputs to the model, user-controllable design parameters, and initial operating conditions. For
a given input u, let y = S(u) denote the observable behavior of the model S. Furthermore, let
d(y1 , y2 ) be a metric defined over the space of the model behaviors. For deterministic models, two
models S1 , S2 are said to be δ-conformant if for all inputs u it holds that d(y1 , y2 ) < δ where
y1 = S1 (u) and y2 = S2 (u) [22, 23, 25]. This notion of deterministic conformance is useful to reason
about worst-case differences between models. However, most CPS applications use components that
exhibit stochastic behavior; for example, sensors have measurement noise, actuators can have man-
ufacturing variations, and most physical phenomena are inherently stochastic. The central question
that this paper considers is: What is the notion of conformance between two stochastic CPS models?
There are some challenges in comparing stochastic CPS models; even if two models are repeat-
edly excited by the same input, the pair of model behaviors that are observed may be different for
every such simulation. Thus, the observable behavior of a stochastic model is more accurately char-
acterized by a distribution over the space of trajectories. A possible way to compare two stochastic
models is to use measure-theoretic techniques to compare the distance between the trajectory dis-
tributions. A number of divergence measures such as the f-divergences, e.g., the Kullback-Leibler
divergence and the total variation distance, or the Wasserstein distance may look like candidate
tools to compare the trajectory distributions. However, we argue in this paper that a divergence is
not the right notion to use to compare stochastic CPS models. There can be two stochastic models
whose output trajectories are very close using any trajectory space metric, but the divergence be-
tween their trajectory distributions can be infinite. On the other hand, there can be two trajectory
distributions with zero divergence for which the distance between trajectories can be arbitrarily far
This raises an interesting question: how do we then compare two stochastic models? In this
paper, we argue that probabilistic bounds derived from the distribution of the distances between
model trajectories (excited by the same input) gives us a general definition of conformance that
has several advantages, as outlined below. We complement this probabilistic viewpoint further and
capture the risk that the distribution of the distancs between model trajectories is large leveraging
risk measures [26].
First, we show that two stochastic systems that are conformant under our definition inherit the
property of transference [22]. In simple terms, transference is the property that if the first model has
certain logical or quantitative properties, then the second model also satisfies the same (or nearly
same) properties. This property brings several benefits. Consider the scenario where probabilistic
guarantees that a model has certain quantitative properties have been established after an exten-
sive and large number of simulations. Ordinarily, if there were any changes made to the model,
establishing such probabilistic guarantees would require repeating the extensive simulation-based
procedure. However, transference allows us to potentially sample from existing simulations for the
first model and sample a small number of simulations from the modified model to establish stochas-
tic conformance between the models, thereby allowing us to establish probabilistic guarantees on the
second model. We demonstrate examples of such transference w.r.t. quantitative properties arising
from quantitative semantics of temporal logic specifications and control-theoretic cost functions.
Next, we show how we can efficiently compute these probabilistic bounds using the notion of
conformal prediction [27, 28] from statistical learning theory. At a high-level, conformal prediction
involves computing quantiles of the empirical distribution of non-conformity scores over a validation
dataset to obtain prediction intervals at a given confidence threshold.
The contributions of this paper are summarized as follows:
• We define stochastic conformance as a probabilistic bound over the distribution of distances
between model trajectories. We also define the non-conformance risk to detect systems that
are at risk of not being conformant.
• We show that both notions have the desirable transference property, meaning that conformant
systems satisfy similar system specifications.
• We show how stochastic conformance and the non-conformance risk can be estimated using
statistical tools from risk theory and conformal prediction.
We first equip the set of output signals Y with a function d : Y × Y → R that quantifies distance
between signals. A natural choice of d is a signal metric that results in a metric space (Y, d). We
use general signal metrics such as the metric induced by the Lp signal norm for p ≥ 1. Particularly,
R 1/p
define dp (y1 , y2 ) := T ∥y1 (t) − y 2 (t)∥p dt so that the L∞ norm can also be expressed as
d∞ (y1 , y2 ) := supt∈T ∥y1 (t) − y2 (t)∥.
It is now easy to see that a signal metric d(Y1 , Y2 ) evaluated over the stochastic signals Y1
and Y2 results in a distribution over distances between realizations of Y1 and Y2 . To reason over
properties of d(Y1 , Y2 ), we will use probabilistic reasoning but we will also consider risk measures [26]
as introduced next.
A risk measure is a function R : F(Ω, R) → R that maps the set of real-valued random variables
F(Ω, R) to the real numbers. Typically, the input of R indicates a cost. There exist various risk
measures that capture different characteristic of the distribution of the cost random variable, such
as the mean or the variance. However, we are particularly interested in tail risk measures that
capture the right tail of the cost distribution, i.e., the potentially rare but costly outcomes.
In this paper, we particularly consider the value-at-risk V aRβ and the conditional value-at-risk
CV aRβ at risk level β ∈ (0, 1). The V aRβ of a random variable Z : Ω → R is defined as
i.e., V aRβ (Z) captures the 1−β quantile of the distribution of Z from the right. Note that there is an
obvious connection between value-at-risk and chance constraints, i.e., it holds that Prob(Z ≤ α) ≥ β
is equivalent to V aRβ (Z) ≤ α. The CV aRβ of Z, on the other hand, is defined as
where [Z − α]+ := max(Z − α, 0) and E(·) indicates the expected value. When the function
Prob(Z ≤ α) is continuous (in α), it holds that CV aRβ (Z) = E(Z|Z ≥ V aRβ (Z)), i.e., CV aRβ (Z)
is the expected value of Z conditioned on the outcomes where Z is greater or equal than V aRβ (Z).
Finally, note that it holds that V aRβ (Z) ≤ CV aRβ (Z), i.e., CV aRβ is more risk sensitive.
For our risk transference results that we present later, we will require that R is monotone,
positive homogeneous, and subadditive:
• For two random variables Z, Z ′ ∈ F(Ω, R), the risk measure R is monotone if Z(ω) ≤ Z ′ (ω) for
all ω ∈ Ω implies that R(Z) ≤ R(Z ′ ).
• For a random variable Z ∈ F(Ω, R), the risk measure R is positive homogeneous if, for any
constant H ≥ 0, it holds that R(HZ) = HR(Z).
• For two random variables Z, Z ′ ∈ F(Ω, R), the risk measure R is subadditive if R(Z + Z ′ ) ≤
R(Z) + R(Z ′ ).
We remark that the V aRβ and the CV aRβ satisfies all three properties [26].
The syntax of STL is recursively defined as
where ϕ′ and ϕ′′ are STL formulas. The Boolean operators ¬ and ∧ encode negations (“not”) and
conjunctions (“and”), respectively. The until operator ϕ′ UI ϕ′′ encodes that ϕ′ has to be true until
ϕ′′ becomes true at some future time within the time interval I ⊆ R≥0 . We derive the operators for
disjunction (ϕ′ ∨ ϕ′′ := ¬(¬ϕ′ ∧ ¬ϕ′′ )), eventually (FI ϕ := ⊤UI ϕ), and always (GI ϕ := ¬FI ¬ϕ).
To determine if a signal y satisfies an STL formula ϕ that is imposed at time t, we can define
the semantics as a relation |=, i.e., (y, t) |= ϕ means that ϕ is satisfied. While the STL semantics
are fairly standard [29], we recall them in Appendix A. Additionally, we can define robust semantics
ρϕ (y, t) ∈ R that indicate how robustly the formula ϕ is satisfied or violated [30,31], see Appendix A.
Larger and positive values of ρϕ (y, t) hence indicate that the specification is satisfied more robustly.
Importantly, it holds that (y, t) |= ϕ if ρϕ (y, t) > 0.
Prob(d(Y1 , Y2 ) ≤ ϵ) ≥ 1 − δ. (2)
Additionally, let R : F(Ω, R) → R be a risk measure and r ∈ R be a risk threshold. Then, S1 and S2
under the input U are at risk of being r-non-conformant if
Eq. (2) is referred to as stochastic conformance and Eq. (3) as non-conformance risk. Let us
now motivate and discuss these two definitions. While the definition of conformance in equation
(2) appears natural at first sight, there are at least two competing ways of defining stochastic
conformance. First, as Y1 and Y2 are distributions, it would be possible to define conformance as
D(Y1 , Y2 ) where D is a distance function that measures the difference between two distributions, such
as a divergence (Kullback–Leibler or f -divergence). However, our definition provides an intuitive
interpretation in the signal space where system specifications are typically defined, while it is usually
difficult to provide such an interpretation for divergences between distributions. Additionally, the
divergence between Y1 and Y2 may be unbounded (or zero) even when equation (2) holds (does not
Proposition 1. There exist stochastic systems S1 and S2 and distance metrics d where equation (2):
i) is satisfied for ϵ > 0 and δ = 0, i.e., w.p. 1, but where the divergence between the systems is
unbounded, and ii) is not satisfied for any given ϵ > 0 and δ ∈ (0, 1), but where the divergence
between the systems is zero.
Proof. Let us first prove i). For simplicity, consider systems S1 and S2 where the stochastic input
and output signals are defined over the time domain T := {t0 , . . . , tT }. Further, for all t ∈ T let
y1 (t) := 0 and y2 (t) := ϵ. Clearly, equation (2) is satisfied, e.g., for d∞ . The distributions D1 and
D2 (joint distributions of Y1 (t) and Y2 (t), respectively) are Dirac distributions centered at 0 and ϵ,
respectively. The Kullback–Leibler divergence between these two distributions is ∞ [32].
Let us now prove ii). Let T consist of a single time point for simplicity so that Y1 and Y2 are
random variables defined over a sample space R. Let D1 and D2 be the same uniform distribution
over [0, a]. Clearly, the divergence between D1 and D1 is zero. We know that the distribution of
Y := Y1 − Y2 has support on [−a, a], and that the probability density function of Y is p(y) :=
1/a − |y|/a2 . We can now compute Prob(|y| ≤ ϵ) = 2ϵ/a − ϵ2 /a2 . Given ϵ > 0 and δ ∈ (0, 1), we
pick an δ̄ ∈ (0, 1) such that δ̄ > δ. We then solve the quadratic equation 2ϵ/a − ϵ√ 2 /a2 = 1 − δ̄
subject to the constraint that ϵ ≤ a. Consequently, we find that a ≥ ϵ/(1 − δ̄)(1 + δ̄) results in
Prob(|y| ≤ ϵ) < 1 − δ so that (2) is not satisfied.
Another way of defining stochastic conformance was presented in [33] where the authors consider
a task-specific definition of stochastic conformance where satisfaction probabilities are required
to be approximately equal. In other words, two stochastic systems are called c-approximately
probabilistically conformant if |Prob((Y1 , τ ) |= ϕ) − Prob((Y2 , τ ) |= ϕ)| ≤ c. In this definition, it
may happen that two systems are c-approximately probabilistically conformant for a small value of
c, while the systems produce completely different behaviors and individual realizations y1 and y2
are vastly different. Additionally to not being task specific, our definition covers the risk of being
r-non-conformant in equation (3).
Finally, we would like to remark that the definition of conformance in equation (2) is related
to the definition of non-conformance risk in equation (3). In fact, when the risk measure R is the
value-at-risk V aRβ , then we know that
Additionally, let R : F(Ω, R) → R be a risk measure and r ∈ R be a risk threshold. Then, we say
that the systems S1 and S2 under the input U are at risk of being r-non-conformant if
R sup d(Y1 , Y2 ) > r. (5)
U ∈U
Based on this definition, note that it will be inherently more difficult to verify Definition 2
compared to Definition 1 due to the sup-operator.
A specific example of the performance function C is the robust semantics ρϕ of an STL spec-
ification ϕ. In fact, the robust semantics ρϕ are Hölder continuous w.r.t. the sup-norm d∞ for
constants H = 1 and γ = 1 [34, Lemma 2]. For the convenience of the reader, we state the proof
of [34, Lemma 2] with the notation used in this paper in Appendix B. The robust semantics are also
Hölder continuous w.r.t. theRSkorokhod metric, see Appendix C. A commonly used performance
function in control is C(y) = 0 y(t)⊤ y(t)dt, and we note that this choice of C is Hölder continuous
w.r.t. d1 as shown in Appendix D. Finally, note that the Hölder continuity condition in equation
(6) implies that, for any constants c, ϵ ∈ R, it holds that
Theorem 1. Let the premises in Definitions 1 and 3 hold. Further, let the systems S1 and S2 under
the input U be (ϵ, δ)-conformant so that equation (2) holds and let C be Hölder continuous w.r.t. d
so that equation (6) holds. Then, it holds that
From here, we can simply see that
Since C is Hölder continuous w.r.t. d, which implies that equation (7) holds, it is easy to conclude
that Prob(C(Y2 ) ≥ c − Hϵγ ) ≥ 1 − δ − δ̄.
Theorem 1 tells us that i) (ϵ, δ)-conformance of systems S1 and S2 under U , and ii) Hölder
continuity of the performance function C w.r.t. the metric d enables us to derive a probabilistic
lower bound for the performance of system S2 w.r.t. C from the performance of system S1 .
We can derive a transference result similar to Theorem 1 when we assume that the systems S1
and S2 are (ϵ, δ)-conformant in the sense of Definition 2 instead of Definition 1.
Theorem 2. Let the premises in Definitions 2 and 3 hold. Further, let the systems S1 and S2 be
(ϵ, δ)-conformant so that equation (4) holds and let C be Hölder continuous w.r.t. d so that equation
(6) holds. Then, it holds that
Since C is Hölder continuous w.r.t. d, we know that equation (6) holds for each U ∈ U. Conse-
quently, we can conclude that Prob(inf U ∈U C(Y2 ) ≥ c − Hϵγ ) ≥ 1 − δ − δ̄ .
Theorem 3. Let the premises in Definitions 1 and 3 hold. Further, let the systems S1 and S2
under the input U not be at risk of being r-non-conformant so that equation (3) does not hold (i.e.,
R(d(Y1 , Y2 )) ≤ r) and let C be Hölder continuous w.r.t. d with γ = 1 so that equation (6) holds. If
the risk measure R is monotone, positive homogeneous, and subadditive, it holds that
Proof. We can derive the following chain of inequalities
R(−C(Y2 )) ≤ R(−C(Y1 ) + Hd(Y1 , Y1 ))
≤ R(−C(Y1 )) + R(Hd(Y1 , Y1 ))
= R(−C(Y1 )) + HR(d(Y1 , Y1 ))
≤ R(−C(Y1 )) + Hr
where (a) follows since C is Hölder continuous w.r.t. d and since R is monotone, (b) follows since
R is subadditive, and (c) follows since R is positive homogeneous, while the inequality (d) follows
since S1 and S2 under U are not at risk of being r-non-conformant, i.e., R(d(Y1 , Y2 )) ≤ r.
This result implies that the risk of system S2 w.r.t. the performance function C is upper bounded
by the risk of system S1 w.r.t. C if the systems S1 and S2 are not at risk of being r-non-conformant.
We remark that a similar result appeared in our prior work [34]. Here, we present these results
in the more general context of conformance and extend the result as we use general performance
functions C, which additionally requires R to be positive homogeneous. Additionally, we derive a
transference result similar to Theorem 3 when we assume that the systems S1 and S2 are not at
risk of being r-non-conformant in the sense of Definition 2 instead of Definition 1.
Theorem 4. Let the premises in Definitions 2 and 3 hold. Further, let the systems S1 and S2 not be
at risk of being r-non-conformant so that equation (5) does not hold (i.e., R(supU ∈U d(Y1 , Y2 )) ≤ r)
and let C be Hölder continuous w.r.t. d with γ = 1 so that equation (6) holds. If the risk measure
R is monotone, positive homogeneous, and subadditive, it holds that
where (a) follows since − inf U ∈U C(Y2 ) = supU ∈U −C(Y2 ), since C is Hölder continuous w.r.t. d,
and since R is monotone, (b) follows since R is subadditive, and (c) follows since R is positive
homogeneous. The inequality (d) follows since S1 and S2 are not at risk of being r-non-conformant,
i.e., supU ∈U R(d(Y1 , Y2 )) ≤ r.
5 Statistical Estimation of Stochastic Conformance
We propose algorithms to compute stochastic conformance and the non-conformance risk. In prac-
tice, note that one will be limited to discrete-time stochastic systems to apply these algorithms.
Theorem 5. Let the premises of Definition 1 hold and Dcal be a calibration dataset with datapoints
(i) (i) (i) (i)
(y1 , y2 ) drawn from D1 × D2 . Further, define Z (i) := d(y1 , y2 ) for all i ∈ {1, . . . , |Dcal |} and
Z (|Dcal |+1) := ∞, and assume that the Z (i) are sorted in non-decreasing
order. Then, it holds that
Prob(d(Y1 , Y2 ) ≤ Z̄) ≥ 1 − δ with Z̄ defined as Z̄ := Z (p) where p := (|Dcal | + 1)(1 − δ) . Thus,
Algorithm 1 Conformance Estimation as per Definition 2
Input: Failure probability δ ∈ (0, 1) and grid size κ > 0
Output: Z̄ such that Prob(supU ∈U d(Y1 , Y2 ) ≤ Z̄ + Lκ) ≥ 1 − δ
1: Construct κ-net Ū of U
2: for Ū ∈ Ū do
3: Ū consisting of realizations (y (i) , y (i) ) under Ū
Obtain calibration set Dcal 1 2
4: Compute Z̄Ū := Z (p) by applying Theorem 5 but instead using dataset Dcal Ū
Ū (line 3). We then compute Z̄Ū so that Prob(d(Y1 (Ū , ·), Y2 (Ū , ·)) ≤ Z̄Ū ) ≥ 1 − δ (line 4). Finally,
we set Z̄ := maxŪ ∈Ū Z̄Ū (line 5).
In Algorithm 2, we compute L̄ such that Prob(L ≤ L̄) ≥ 1 − δL . We uniformly sample control
inputs (U ′ , U ′′ ) (line 2), obtain realizations (y1′ , y2′ ) from D1 × D2 under U ′ and realizations (y1′′ , y2′′ )
from D1 × D2 under U ′′ (line 3), and compute the non-conformity score L(i) (line 4). In line 5, we
obtain an estimate L̄ of the Lipschitz constant L that holds with a probability of 1 − δL over the
randomness introduced in Algorithm 1.
Theorem 6. Let the premises of Definition 2 hold. If the Lipschitz constant L of d(Y1 (·, ω), Y2 (·, ω))
is known uniformly over ω ∈ Ω, then, for a gridding parameter κ > 0, the output Z̄ of Algorithm 1
ensures that
Thus, the systems S1 and S2 are (ϵ, δ)-conformant if Z̄ + Lκ ≤ ϵ. Otherwise, let δL ∈ (0, 1) be a
failure probability, then the output L̄ of Algorithm 2 ensures that
Proof. From line 4 of Algorithm 1 we know that Prob(d(Y1 (Ū , ·), Y2 (Ū , ·)) ≤ Z̄Ū ) ≥ 1 − δ for
each Ū ∈ U. Due to Lipschitz continuity, we can conclude that for each U ∈ U that is such that
¯ Ū ) ≤ κ it holds that
Since Ū is a κ-net of U, it follows that Prob(supU ∈U d(Y1 , Y2 ) ≤ Z̄ + Lκ) ≥ 1 − δ.
For the second part of the proof, note that from line 5 of Algorithm 2 we know that Prob(L ≤
L̄) ≥ 1 − δL . We can now union bound over this event and Prob(d(Y1 (Ū , ·), Y2 (Ū , ·)) ≤ Z̄Ū ) ≥ 1 − δ
so that
Proposition 2. Let the premises of Definition 1 hold and Dcal be a calibration dataset with dat-
(i) (i)
apoints (y1 , y2 ) drawn from D1 × D2 . Let β ∈ (0, 1) be a risk level and γ ∈ (0, 1) be a failure
(i) (i)
threshold. Define Z (i) := d(y1 , y2 ) for each i ∈ {1, . . . , |Dcal |} and assume that Prob(Z ≤ α) is
continuous in α. Then,
Prob V aRβ ≤ V aRβ (d(Y1 , Y2 )) ≤ V aRβ ≥ 1 − γ.
n q o
where we have V aRβ := inf α ∈ R Prob(Z
d ≤ α) − ln(2/γ)
2|Dcal | ≥ β and V aRβ :=
n q o
inf α ∈ R Prob(Z
d ≤ α) + ln(2/γ)
2|Dcal | ≥ β with the empirical cumulative distribution function
|D |
≤ α) := |D1cal | i=1cal I(Z (i) ≤ α) and the indicator function I.
For estimating CV aRβ (Z), we assume that the random variable d(Y1 , Y2 ) has bounded support,
i.e., that Prob(d(Y1 , Y2 ) ∈ [a, b]) = 1. Note that d(Y1 , Y2 ) is usually bounded from below by a := 0
if d is a metric. To obtain an upper bound, we assume that the distance function saturated at b,
e.g., by clipping values larger than b to b. In practice, this means that realizations that are far apart
already have a large distance and are capped to b.
Proposition 3. Let the premises of Definition 1 hold and Dcal be a calibration dataset with dat-
(i) (i)
apoints (y1 , y2 ) drawn from D1 × D2 . Let β ∈ (0, 1) be a risk level and γ ∈ (0, 1) be a failure
(i) (i)
threshold. Define Z (i) := d(y1 , y2 ) for each i ∈ {1, . . . , |Dcal |} and assume that Prob(d(Y1 , Y2 ) ∈
[a, b]) = 1. Then, it holds that
Prob CV aRβ ≤ CV aRβ (d(Y1 , Y2 )) ≤ CV aRβ ≥ 1 − γ.
q q
5 ln(3/γ) 11 ln(3/γ)
where CV aRβ := CV aRβ + |Dcal |(1−β) (b − a) and CV aRβ := CV aRβ − |Dcal |(1−β) (b − a) where
\ \
P|D |
inf α∈R α + (|Dcal |(1 − β))−1 i=1cal [Z i − α]+ .
the empirical estimate of CV aRβ (Z) is CV
\ aRβ :=
As a consequence of these two lemmas, we know that with a probability of 1 − γ the systems S1
and S2 under the input U are at risk of not being conformant if V aRβ ≥ α or CV aRβ ≥ α based
on the risk measure of choice.
(a) Targets in (b) CARLA: Cross-track er- (c) F16: altitude signals for (d) Spacecraft Trajectories
Dubin’s car ror signals for S1 , S2 S1 , S2
Figure 1: The solid lines refer to Y1 and the dashed lines refer to Y2 ; in each of the displayed plots,
the initial condition for each pair of realizations is the same.
(a) Trajectory distance on valida- (b) Robustness on validation set (c) Robustness on validation set
tion set controller 1 controller 2
Figure 2: Distance and robustness histogram for Dubin’s car with δ = δ̄ = 0.05. We use CV aR(d)
to denote CV aR(d(Y1 , Y2 )). The c1 and ϵ are the values of conformal prediction on the calibration
set of ρϕdubin (Y1 ) and d∞ (Y1 , Y2 ).
6 Case Studies
We now demonstrate the practicality of stochastic conformance and risk analysis through various
case studies. For validation, if we obtain the value Z̄ using a conformal prediction procedure for a
nonconformity score defined by the random variable Z, i.e., such that Prob(Z ≤ Z̄) ≥ 1 − δ. Then,
given a test set Dtest , the validation score is defined as V S(Z) := |{z ∈ Dtest | z ≤ Z̄}|/|Dtest |.
Distance |Dcal | ϵ d(Y1 , Y2 )
Metric V S(d(Y1 , Y2 )) V aR(d(Y1 , Y2 )) CV aR(d(Y1 , Y2 ))
50 0.7825 0.987 0.7183 0.7947
1000 0.7163 0.956 0.7148 0.7647
d∞ 2000 0.7122 0.952 0.712 0.7814
3000 0.7118 0.952 0.7117 0.7862
50 0.6723 0.953 0.6517 0.7181
dsk 1000 0.6722 0.972 0.6711 0.7156
(Skorokhod 2000 0.6645 0.96 0.6639 0.7106
Distance) 3000 0.6619 0.952 0.6613 0.7079
50 2.6086 0.937 2.503 2.612
1000 2.7339 0.96 2.732 3.048
d2 2000 2.7071 0.944 2.706 3.044
3000 2.7238 0.955 2.722 3.0929
Table 1: Effect of calibration set size on the validation score and risk measures. The size of the test
set, i.e., |Dtest |, is 1000. We use the conformal prediction procedure from Section 5 to obtain ϵ as
defined in Definition 1 for δ = 0.05.
The two systems that we compare have two different feedback controllers. The first feedback
controller uses the method from [43, 44] and the second controller uses the method from [45]. We
plot a set of sampled trajectories in Fig. 1a. This figure also shows the set of initial states I :=
[−1, 0] × [−1, 0]. The controller aims to ensure that the system trajectory stays within a series of
sets T1 through T50 , the corresponding STL specification is ϕdubin := 50 i=1 [i−1,i] ([xi yi ] ∈ Ti ). For
the experiments that follow, we uniformly sampled initial states from I and noise η x , η y from the
described Gaussian distribution.
Effect of calibration set size. In the first experiment, we wish to benchmark the effect of the size of
the calibration set Dcal for various distance metrics. The results are shown in Table 1. The table
shows that with smaller sizes of the calibration set, we get a more conservative ϵ for d∞ (which
translates into a higher validation score). The V aR is almost identical to the value of ϵ at larger
Dcal sizes. We note that the CV aR values change with the value of V aR. A similar trend can be
observed the Skorokhod distance and the L2 -metric.
Empirical evaluation of transference. We empirically demonstrate that Theorem 1 holds. We use
C(Y ) = ρϕdubin (Y ), i.e., the robust semantics w.r.t. the property ϕdubin , and the L∞ signal metric
d∞ . The results are shown in Table 2. We can see that the predicted upper bound for the robustness
of realizations of Y2 w.r.t. ϕdubin is negative (c1 − ϵ), so it is not possible to conclude that the second
system satisfies ϕdubin with probability greater than 1 − δ − δ̄. However, we note that c2 is indeed
greater than the bound (c1 − ϵ). Similarly, we show that Theorem 3 is also empirically validated by
computing the CV aR values for the first system and the risk measure on d∞ (Y1 , Y2 ). We show the
empirical distributions of d∞ (Y1 , Y2 ), and ρϕdubin (Yi ) for i = 1, 2 in Figure 2.
Empirical evaluation of Theorem 2. We next apply Algorithms 1 and 2 to this case study. We grid
the initial set of states evenly into 25 cells with a grid size of κ = 0.02. We sample 650 trajectories
|Dcal | c1 ϵ V S(ρ1 ) V S(d∞ ) c2 Thm 1 CV aR Thm.3
valid? −d∞ −ρ1 −ρ2 valid?
δ = 0.2, 100 0.31 0.59 0.95 0.76 0.21 Y 0.90 -0.28 0.00 Y
δ̄ = 0.05 3K 0.30 0.60 0.95 0.79 0.20 Y 0.93 -0.27 0.03 Y
δ = 0.1, 1K 0.30 0.67 0.96 0.92 0.15 Y 0.79 -0.27 0.02 Y
δ̄ = 0.05 3K 0.30 0.66 0.95 0.91 0.15 Y 0.81 -0.27 0.03 Y
δ = 0.05, 2K 0.31 0.71 0.94 0.95 0.11 Y 0.78 -0.27 0.02 Y
δ̄ = 0.05 3K 0.30 0.71 0.95 0.95 0.11 Y 0.79 -0.27 0.03 Y
Table 2: Empirical evaluation of transference. Let ρi be short-hand for ρφdubin (Yi ) for i = 1, 2,
and d∞ be short-hand for d∞ (Y1 , Y2 ). Using Theorem 5, we show Prob(ρ1 ≥ c1 ) > 1 − δ̄, and
Prob(d∞ ≤ ϵ) > 1−δ. The validity scores for each guarantee on a test set Dtest with 1000 samples are
shown. The value c2 is obtained using Theorem 5 on ρ2 and observe that it exceeds c1 −ϵ, validating
Theorem 1. Similarly, we report the CV aR values for −ρ1 and d∞ , and CV aR(−ρ1 )+CV aR(d∞ ) ≥
CV aR(−ρ2 ) for all cases, validating Theorem 3.
Table 3: Transference results for various case studies. We use δ = 0.05 and δ̄ = 0.05. As before, ρ1
is used as short-hand for ρϕ (Y1 ) for each spec, and d∞ is used as short-hand for d∞ (Y1 , Y2 ).
on each cell to obtain their calibration sets. Algorithm 2 gives Z̄ = 0.7562 and Lκ = 0.0687, giving
Z̄ + Lκ = 0.8249. We then evaluate on two test sets of unseen initial conditions with |Dtest | = 1000,
2500. The success rate on the test sets are 0.9996 and 1.0, with the goal success rate being 0.9. The
experiments demonstrate the effectiveness of Theorem 6.
Case Study Spec |Dcal | |Dtest | δ CV aR
d∞ −ρ1 −ρ2
F-16 ϕgcas 1K 3K 0.01 200.3 -62.3 -62.3
CARLA ϕquad 7K 3K 0.01 2.04 -0.31 0.88
Satellite ϕsat 7K 3K 0.01 0.19 0.0 0.08
Table 4: Empirical validation of risk transference for all case studies. As before, ρi is short-hand
for ρϕ (Yi ), and d∞ is short-hand for d∞ (Y1 , Y2 ). Here, we set the risk level β = δ in each case.
7 Related Work
Conformance has found applications in cyber-physical system design [49, 50] as well as in drug
testing and other applications [51–53]. Our work is inspired by existing works for conformance
of deterministic systems by which we mean that systems are non-stochastic, see [21, 54] for
surveys. The authors in [23–25] considered conformance testing between hybrid system. To capture
distance between hybrid system trajectories that may exhibit discontinuities, signal metrics were
considered that simultaneously quantify distance in space and time, resembling notions of system
closeness in the hybrid systems literature [55, 56]. For instance, [23] proposes (T, J, (τ, ϵ))-closeness
where τ and ϵ capture both timing distortions and state value mismatches, respectively, and where
T and J quantify limits on the total time and number of discontinuities, respectively. A stronger
notion compared to (T, J, (τ, ϵ))-closeness was proposed in [22] by using the Skoroghod metric. The
benefit of [22] over the other notion is that it preserves the timing structure. All these works derive
transference results with respect to timed linear temporal logic or metric interval temporal logic
Conformance of stochastic systems has been less studied. The authors in [57] propose
precision and recall conformance measures based on the notion of entropy of stochastic automata.
The authors in [58] use the Wasserstein distance to quantify distance between two stochastic systems,
which is fundamentally different from our approach. (Bi)simulation relations for stochastic systems
were studied in [59–61]. Such techniques can define behavioral relations for systems [62, 63], and
they can be used to transfer verification results between systems [64]. The authors in [65] utilize such
behavioral relations to verify RL policies between a concrete and an abstract system. We remark
that bisimulations are difficult to compute, see e.g., [66], unlike our approach. Probably closest to
our work is [33]. However, in this paper conformance is task specific which allows two systems to
be conformant w.r.t. a system specification even when the systems produce completely different
trajectories. Additionally, we consider a worst-case notion of conformance where no information
about the input that excites both stochastic systems is available.
8 Conclusion
We studied conformance of stochastic dynamical systems. Particularly, we defined conformance
between two stochastic systems as probabilistic bounds over the distribution of distances between
model trajectories. Additionally, we proposed the non-conformance risk to reason about the risk of
stochastic systems not being conformant. We showed that both notions have the transference prop-
erty, meaning that conformant systems satisfy similar system specifications. Lastly, we showed how
stochastic conformance and the non-conformance risk can be estimated from data using statistical
tools such as conformal prediction.
The authors would like to thank the anonymous reviewers for their feedback. The National Science
Foundation supported this work through the following grants: CAREER award (SHF-2048094),
CNS-1932620, CNS-2039087, FMitF-1837131, CCF-SHF-1932620, the Airbus Institute for Engi-
neering Research, and funding by Toyota R&D and Siemens Corporate Research through the USC
Center for Autonomy and AI.
(y, τ ) |= ϕ′ UI ϕ′′ iff ∃τ ′′ ∈ (τ ⊕ I) ∩ T s.t. (y, τ ′′ ) |= ϕ′′ and ∀τ ′ ∈ (τ, τ ′′ ) ∩ T, (y, τ ′ ) |= ϕ′ .
The robust semantics ρϕ (y, t) provide more information than the semantics (y, t) |= ϕ, and
indicate how robustly a specification is satisfied or violated. We first define the predicate robustness
where Oµ := {y ∈ Rn |h(y) ≥ 0} denotes the set of all states that satisfy the predicate µ, cl(·)
denotes the closure of a set, and ∥ · ∥ denotes a vector norm. We now can recursively calculate
ρϕ (y, t) based on the structure of ϕ using the following rules:
ρTrue (y, τ ) := ∞,
dist¬µ (y, τ ) if y(τ ) ∈ Oµ
ρµ (y, τ ) :=
−distµ (y, τ ) otherwise,
ρ¬ϕ (y, τ ) := −ρϕ (y, τ ),
′ ′′ ′ ′′
ρϕ ∧ϕ (y, τ ) := min(ρϕ (y, τ ), ρϕ (y, τ )),
′ ′′
′′ ′
ρϕ UI ϕ (y, τ ) := sup min ρϕ (y, τ ′′ ), inf ρϕ (y, τ ′ ) .
τ ′′ ∈(τ ⊕I)∩T τ ′ ∈(τ,τ ′′ )∩T
The proof idea is largely based on ideas from [34, Lemma 2] in which it was, however, only stated that
ρϕ (y1 , τ )−ρϕ (y2 , τ ) ≤ d∞ (y1 , y2 ). While the other direction follows trivially, we present the full proof
using the notation of this paper for the convenience of the reader. Let us now assume that the STL
formula ϕ is in positive normal form, e.g., that ϕ contains no negations. This assumption is made
without loss of generality and the result holds for any STL formula since every STL formula ϕ can be
rewritten into a semantically equivalent STL formula that is in positive normal form [67, Proposition
2]. We will perform the proof recursively on the structure of the formula ϕ.
Predicates. First note that dist¬µ (y, τ ) is a Lipschitz continuous function with Lipschitz
constant one in the sense that |distµ (y1 , τ ) − distµ (y2 , τ )| ≤ ∥y1 (τ ) − y2 (τ )∥, see for instance [68,
Chapter 3]. Accordingly, it can easily be seen that ρµ (y, τ ) is Lipschitz continuous with Lipschitz
constant one so that |ρµ (y1 , τ ) − ρµ (y2 , τ )| ≤ ∥y1 (τ ) − y2 (τ )∥ ≤ d∞ (y1 , y2 ).
Conjunctions. For conjunctions ϕ′ ∧ ϕ′′ and by the induction assumption, it holds that
′ ′ ′′ ′′
|ρ (y1 , τ ) − ρ (y2 , τ )| ≤ d∞ (y1 , y2 ) and |ρϕ (y1 , τ ) − ρϕ (y2 , τ )| ≤ d∞ (y1 , y2 ). Now, it follows that
ϕ ϕ
′ ′′ ′ ′
ρϕ ∧ϕ (y1 , τ ) = min(ρϕ (y1 , τ ), ρϕ (y1 , τ ))
′ ′′
≥ min(ρϕ (y2 , τ ), ρϕ (y2 , τ )) − d∞ (y1 , y2 )
′ ′′
= ρϕ ∧ϕ (y2 , τ ) − d∞ (y1 , y2 ).
dsk (y1 , y2 ) = inf max (d∞ (r(t), t), d∞ (y1 , y2 ◦ r)) . (8)
For every retiming r(t), the operand of the inf operator is the maximum between the magnitude of
the retiming and the d∞ distance between y1 and the retimed y2 . Skorokhod distance is then the
infimum across all retimings.
Proposition 5. Let ϕ be an STL specification and let dsk : Y × Y → R be the Skorokhod metric.
If the output signals are Lipschitz-continuous with Lipschitz constant Ky , it holds that the robust
semantics ρϕ are Hölder continuous w.r.t. dsk with H = (1 + Ky ) and γ = 1.
Proof. Let r(t) be the optimal retiming in the definition of the Skorokhod metric. Let dsk (y1 , y2 ) = α
for brevity. Under the optimal retiming r, this means that d∞ (r(t), t)) ≤ α and d∞ (y1 , y2 ◦ r) ≤ α.
From the second term, we derive
α ≥ d∞ (y1 , y2 ◦ r)
= sup ∥y1 (t) − y2 (r(t))∥
= sup ∥(y1 (t) − y2 (t)) − (y2 (r(t)) − y2 (t))∥
≥ sup ∥y1 (t) − y2 (t)∥ − sup ∥y2 (r(t)) − y2 (t)∥
t t
≥ d∞ (y1 , y2 ) − Ky sup |r(t) − t|
= d∞ (y1 , y2 ) − Ky α
where we used the reverse triangle inequality in (a), the fact that y2 (t) is Lipschitz continuous with
constant Ky in (b), and that −d∞ (r(t), t) > −α in (c). In other words, α ≥ d∞ (y1 , y2 ) − Ky α, or
From Proposition 4, we know that |ρϕ (y1 ) − ρϕ (y2 )| ≤ d∞ (y1 , y2 ) so that it follows that
D Hölder Continuity of C(y) := 0 y(t)⊤ y(t)dt w.r.t d1
Proposition 6. Let C(y) := 0 y(t)⊤ y(t)dt be the performance function and let d1 : Y × Y → R
be the L1 signal metric. If the output signal is of bounded magnitude, i.e., there exists ymax s.t.
∥y(t)∥ < ymax , then it holds that the performance function C is Hölder continuous w.r.t. d1 with
H = 2ymax and γ = 1.
where (a) follows by the use of the reverse triangle inequality and (b) follows by the boundedness
assumption of signals. The other direction follows similarly so that the result follows trivially.