1 Introduction
Layered queueing network (LQN) models are well matched to analyzing the performance of large distributed server systems such as web services systems based on microservices in the cloud. Models of hundreds of services can be solved in seconds or minutes. However, automated model construction provides models with excessive detail, and automated analysis techniques must explore many (thousands) of variations on a system. A simplified model, focused on the important system resources, may be essential.
Expert modellers decide what to include in a performance model and what to ignore or approximate. However, automated construction from execution data (e.g. [
1,
2,
18,
41]) or design data (e.g. [
1,
28,
39]) includes every component. Very large models are not useful for automated analysis such as optimization of the deployment [
5,
24] or of the design (e.g., Reference [
28], which required over 1,000 model evaluations). Since one of the goals of the automated techniques is to support analysts with less expertise, focused modeling is an enabling technology for automation.
Performance models are always simplifications of reality, to some degree, and many techniques have been devised to further simplify a given model, depending on its form. For queueing models, the
Flow-Equivalent Service Centre (FESC) [
23] is used to represent a subsystem by a single server. However simultaneous resource possession in LQNs [
19] limit its usefulness for layered queueing, so this work looks elsewhere for a simplification approach. Other formalisms such as Petri Nets and Markov Chains are not considered here, because they do not scale up well enough for modeling large server systems. However, they do have a large literature on simplification, including state lumping and aggregation [
3,
26,
32], sometimes guided by symmetries in the corresponding structural model (as in well-formed nets [
14]).
A motivating example model is shown in Figure
1(a), with an
original model (OM) with 86 components created automatically from a system design in the Palladio software tool [
28]. The analysis reported in Reference [
28] required over 1,300 model evaluations and took many hours. The
focused model (FM) in Figure
1(b) has only eight components and gives almost identical predictions for performance under varying loads. To compare the run times of the model, 1000 evaluations of these two models for varying parameters on a desktop computer required 6,311 seconds (1 hour, 45 minutes, and 11 seconds) for the OM and 6.42 seconds for the FM, a factor of nearly 1,000.
The goal of this work is
to analyze the sensitivity of LQN models to changes in design, configuration, and deployment using aggregated models. Aggregation focusses on the key model elements, while preserving enough of the model to give sufficient accuracy for a desired range of changes to the base model. Earlier incomplete work by the same authors [
15,
16] introduced special dependency groups for aggregation. This work completes that approach and determines how to ensure the accuracy of the sensitivities. It describes an algorithm FSPT in which
•
Layered queueing models are simplified to support automated sensitivity analysis.
•
The analyst can choose which components to preserve in the model focus and a range of system throughputs (system scales) for which sensitivities will be computed. Preservation of components provides traceability of their parameter impacts.
•
Additional components are preserved to cover the desired range of system scales.
•
The non-preserved components are aggregated in groups determined by system dependencies.
The contributions of this article are, first, to fill a gap in the earlier method including an updated evaluation of accuracy. Second, an effectiveness index is defined to characterize models that resist simplification. Third, an
“Accurate Sensitivity Hypothesis” (ASH) is defined that guides the correct use of simplified models. Fourth, the nature of sensitivity results is explored on randomly generated models. For simplifications satisfying the ASH, over 90% of many thousands of sensitivity calculations were found to be accurate within 20%. In Section
4, it is found empirically that the accuracy of a predicted change in performance depends on the resource saturation of the aggregated and non-aggregated components, and on the scale-up (the change in the system throughput). The
saturation Si for each component and a
saturation ratio (SR) for the focused model are given by the following:
Resource multiplicity is defined in Section
2.1. A larger
SR gives a broader focus (by preserving more components) and a more accurate approximation. If
SR is not large enough, then errors related to queueing delays at out-of-focus servers may become significant when a model change increases the load. Changes that decrease load tend to have decreased errors, which is intuitively reasonable.
The following empirical and heuristic condition for accurate sensitivity predictions is introduced in this article, as described in Section
4.7.
Accurate Sensitivity Hypothesis
Performance predictions by a FM with Saturation Ratio
SR will “almost always” have “acceptable accuracy” if the throughput for the prediction does not increase or decrease by a multiplier larger than
SR – 2. From the results analyzed in Section
4.7, “acceptable accuracy” was taken to be errors of less than 20%, for which “almost always” means “in more than 90% of cases.”
The ASH defines a “trusted range” for results, in which the throughput multiplier is in the range \((1/ (SR-2),\,SR-2)\) and implies that SR > 2 to reliably give acceptably accurate sensitivities.
Section
4 describes a wide-ranging evaluation that supports this hypothesis and shows that it is conservative in the sense that many predictions that do not satisfy it, still have less than 20% error.
4 Sensitivity Analysis Using A Focused Model
Models are used to predict the impact of changes through sensitivity analysis. Suppose the set of parameters to be changed are collected in a parameter vector
a and the performance measure of interest is the throughput λ(
a). We define the following:
•
\(\bar{a} =\) base value of parameter vector a,
•
\({\lambda }_{OM}{\rm{(}}a{\rm{)\, and}}\,{\lambda }_{FM}{\rm{(}}a{\rm{)}} =\) the throughput of models OM and FM for parameter values given by a,
•
\(M(\bar{a},a)\, = {\lambda }_{OM}{\rm{(}}a{\rm{)}}/{\lambda }_{OM}{(\bar{a})} =\) the ratio of throughputs in OM due to a change in a,
•
\(\Delta{\lambda}_{OM}(\bar{a},a) = {\lambda }_{OM}(a) - {\lambda}_{OM}(\bar{a}) =\) the change in throughput in OM due to a change in a,
•
\(Er{r}_{Sens}(\bar{a},a) = [\Delta {\lambda }_{FM}(\bar{a},a)-\Delta {\lambda }_{OM}(\bar{a},a)]/{\lambda}_{OM}(\bar{a}) =\) the relative error of the change in throughput estimated by FM.
The relative error
ErrSens \((\bar{a},a)\) will be examined as a function of the parameter change
\((a - \bar{a})\) . The target value for “adequate” accuracy of
ErrSens was arbitrarily chosen at 20%. Based on the ASH stated in Section
1 and discussed below, the “trusted range” of results for a FM with saturation ratio
SR satisfies
•
\(1/(SR - 2) < M(\bar{a},a)\, < SR - 2\) (for the “trusted range”)
This implies that SR must be more than 2 to provide trusted sensitivity results, and trust depends on the throughput change.
4.1 Sensitivity Calculation: An Example
A first example considers the OM and FM shown in Figure
13, and some parameters of the preserved resources
\({\bf PR}\,= \,\{Users,\,t\textit{01},\,t\textit{07},\,p\textit{00}\}\) .
Figure
14 shows the throughput changes
\(\Delta \lambda _{{\rm{OM}}}\) and
\(\Delta \lambda _{{\rm{FM}}}\) when three different parameters are varied one at a time: the demand of
t01 in part (a), the demand of
t07 in part (b), and the multiplicity of
p00 in part(c). The throughputs for OM and FM are almost identical in all three cases. These results are encouraging but not all models give such good results.
4.2 Case Study: Sensitivity of the Telephone Switch Model
The OM of a Class V telephone switch was described in Section
3.5 and Figure
8 above. An FM with saturation ratio
SR = 3.33 was created, shown as TS-FM2 in Figure
8(c), with six tasks and five hosts. The trusted range of throughput multiples is the interval (0.75, 1.33).
A set of 1,000 sensitivity cases were calculated in which all the demand parameters of every preserved task were changed by an independent random factor that was uniformly distributed over (0.5, 1.5), giving a range of ±50% from the base case values. Figure
15 shows histograms of the errors for the cases with results within the trusted range for 300 users. For 300 users, the system is not saturated, so the throughput changes are relatively small and the errors are all less than 2.5%. For 400 users, loads are heavier, some cases have throughput changes outside the trusted range, sensitivities are larger, and 117 sensitivities have errors above 20%.
Greater insight into the importance of the trusted range is given by the plot in Figure
15 of
\(Err_{{\rm{sens}}}\) (
\(\bar{a}\) , a) against the throughput multiple
\(M(\bar{a},\,a)\) . For 400 users, many throughput changes fall outside the trusted range, and the relevance of the Accurate Sensitivity Hypothesis is displayed in the larger errors for many of these points. Within the trusted range, 93.4% of the results have “adequate” accuracy
\((Err_{sens}\,<\,0.2)\,\) and outside it about half.
4.3 Large-scale Sensitivity Experiments
To move away from single cases and seek insight into the general validity of the sensitivity analysis, the set of 150 OMs used in Section
3.3 for assessing the base-case FMs were re-analyzed for sensitivity. Three sets of FMs were created using three different strategies:
•
The Accuracy strategy ACC2, which was also used in Section
3. For each OM the size of
PT was increased one task at a time until the FM gave
Err < 2%. The saturation ratios
SR are variable, ranging from 1 to 518.
•
The Saturation strategy SAT3.3 ( \(SR\,= \,3.3\) , giving a moderately broad focus).
•
The Saturation strategy SAT10 ( \(SR\,= \,10\) , giving a very broad focus).
4.4 Sensitivity to CPU Demand Parameters
As in Section
4.2, the demand parameters of all the preserved tasks were multiplied by a random factor uniformly distributed over (0.5, 1.5). Nine random FMs were created for each OM, giving a total of 1,350 cases for each strategy. Figure
17 shows the error histogram for each strategy; there are fewer than 1,350 points because of a few non-convergent model solutions.
The ACC2 strategy gave the worst results (Figure
17(a)) but still over 1,300 of the 1,350 points have error less than 20%. The large errors were in cases with low saturation ratios
SR in the range (1.07, 1.36) and high-impact parameter changes. For SAT3.3 and SAT10 all the results have errors less than 20%. The advantage of using the SAT strategies for sensitivity analysis seems clear.
Figure
18 shows scatter plots against the throughput multiple
M. For the ACC2 FMs it is clear that the largest errors are associated with
M > 1. Since
SR is different for every FM, a trusted range cannot be mapped. For the SAT cases, the trusted range in
M can be plotted and is (0.75, 1.33) in part (b) and (0.125, 8) in part (c). SAT10 is only slightly better than SAT3.3, indicating that the extra large focus was not necessary for these parameter changes.
4.5 Sensitivity to System Scaling Changes
System scaling increases the capacity of resources to handle larger loads on the system. These experiments applied a common scale factor
“scale” (ranging from 2 to 10) to the capacity of every preserved resource (tasks and hosts) and to the number of users and evaluated the throughput. Successful scaling should give an increase in throughput by a factor of
f also. Figure
19(a) shows the throughputs against the scale factor, and in Figure
19(b) the points along the line show
Errsens for the successive scale factors (
scale = 1, 2,
\(\ldots\) ) against
M, the OM throughput multiple. The FM throughput increases more than OM, because a resource that saturates in OM has been aggregated with more lightly used resources, which share the load. This model provided limited scaling, but in some cases both model throughputs scaled all the way up to 10 times.
4.6 Large-scale Experiments on Resource Scaling
As described above, these experiments scaled the resources of the 150 random models by factors
scale from 2 to 10. Because the changes in throughput are large, the normalization of error was changed. For large deviations it is better to normalize to Δλ
OM instead of λ
OM, giving the measure
\(Err_{largeSens}\) :
Using this normalization of the error, Figure
20 shows histograms of
ErrlargeSens that are similar to the histograms for demand sensitivities. Only the SAT10 cases are mostly accurate. The ACC2 and SAT3.33 cases have poor accuracy in many cases, because the saturation ratios of the FMs are not large enough for aggressive scaling. This issue was investigated more deeply for the ACC2 cases, which covered a wide range of ratios in an attempt to identify the inaccurate results for the analyst, which led to the Accurate Sensitivity Hypothesis.
4.7 The ASH (Accurate Sensitivity Hypothesis) and the Trusted Range of Scale-Up
Deeper analysis of the ACC2 case errors in Figure
20 led to the Accurate Sensitivity Hypothesis stated in Section
1. Many of the errors were associated with small values of the saturation ratio
SR, which in turn indicates that some resources are close to saturation in OM but are not preserved. As the throughput increases (and
M becomes large) these resources become saturated in OM but not FM, and errors become substantial. An example called “case-30” is shown in Figure
19 above.
To examine this relationship, the value of
SR was found for each FM found by ACC2. Figure
21 plots
M against
SR separately for the “accurate” and “inaccurate” results and shows that inaccurate results emerge when
M is larger than a value that is roughly a constant plus
R. The sloping boundary of the grey area was chosen by eye so it contains almost all the inaccurate results and represents
M = SR – 2. The points inside the grey area violate the ASH condition, and the others satisfy it and define a “trusted range” for
M for a FM with a given
R. There are many false negatives, showing that the hypothesis is quite conservative. There are also some false positives, showing that the trusted range does not provide a guarantee. However, there are very few false positives.
4.8 Accuracies for Cases Satisfying the ASH
We can revisit the cases described so far to see how successful the ASH is in predicting accurate results. Table
1 summarizes the number of results that satisfy the ASH and the number of these with less than 20% error using the
\(Err_{sens}\) measure for the demand studies and the
\(Err_{largeSens}\) measure for the scaling studies.
For the demand sensitivity results, the trusted range was found to be reliable, but for the scaling results there are some “false positives,” results that test to be accurate but are not about 2% for the SAT10 cases and about 7% for the ACC2 cases and 7.5% for some of the telephone switch results. Most of those errors are just over 20% and few exceed 30%.
4.9 Sensitivity Analysis on Simplified Models: Summary
The random cases in this section apply many large parameter changes and thus place a heavy stress on the accuracy of sensitivities. Accurate sensitivities require a model with a large-enough saturation ratio R, and the Accurate Sensitivity Hypothesis provides insight to the analyst when SR is not large enough. When the ASH condition was met, over 90% of cases had errors less than 20%. The relative error was normalized to the base value of the measure for small changes (less than 100%) and to the OM change itself for larger changes.
5 Related Work
We can speak of data-driven, subsystem-driven, and problem-driven simplification approaches for queueing models and other performance models.
A
data-driven approach controls the complexity of a model fitted to data statistical estimates of accuracy (see, e.g., Reference [
22]). In References [
30,
41] the authors used stepwise fitting and statistical tests to simplify a workload model with hundreds of user classes. In Reference [
38] it was used to add components to a performance model. Machine learning models are also data driven (e.g., Reference [
21,
36]) and can exploit simplification as in Reference [
8] to avoid over-fitting.
Subsystem-driven simplification replaces selected subsystems by a surrogate delay [
19] or a server and often uses the methods of model decomposition as described in Reference [
7] and Reference [
29]. Surrogate delays have been applied to decomposing timed Petri Net models that are too large to solve into sets of smaller submodels (e.g., References [
6,
25]). The special case of symmetrically replicated subsystems leads to symmetries that simplify the solution. In LQNs, there is already a model feature for declaring and solving replicated subsystems (see a summary in Reference [
9]).
Queueing model class aggregation has been achieved by standard clustering methods [
30].
In queueing networks, a FESC can be used to approximate a subsystem, as described in Reference [
23]. A FESC is a server whose rate depends on its occupancy level, and for some models a FESC is an exact replacement [
4]. However, FESCs have limited application to Layered Queueing models because of the simultaneous resource possession intrinsic in layered queueing. When a server inside the chosen subsystem makes a request to a server outside it and waits, there is a service-rate dependency on the state of the entire system rather than just on the subsystem. For this reason, we look beyond FESCs to represent subsystems in the present work.
Most research on decomposition has a different goal from the present article, a goal to create a “divide and conquer” approximate solution technique for intractable models. Examples are Reference [
7] for queueing networks, the LQN solver [
10], and References [
6,
25] for timed Petri Nets. The submodels in a decomposition may combine different modeling paradigms, as in Reference [
13], which has a Markov Chain submodel.
These approaches have limitations. Surrogate delays require iteration to solve, which may be lengthy. A FESC requires repeated solutions of the submodel in isolation, for every population that may occur. For non-separable and layered queueing networks this is expensive.
Problem-driven simplification focusses the analysis on a particular problem. An example is bound analysis for asymptotic cases of light and heavy loads. There are well-known asymptotic and balanced-job bounds on response time and throughput for queueing networks [
23] and additional bounds for LQNs [
27].
In the context of these categories, FSPT is problem driven in that the focus can be tailored to the components of concern, wherever they are.
The simplification approaches described above use either structural aggregation (where a group of model elements corresponding to system structural elements are aggregated into a subsystem) or state aggregation (where a group of system states are aggregated into a meta-state).
Structural aggregation in queueing models has a counterpart in Petri net models, in which elements are “folded” into replacement elements. For instance, Stochastic Petri Nets may be simplified by using structural simplification rules or foldings [
33] or well-formed nets [
14].
A brief look at simplification of other kinds of models is useful for perspective. Besides structural models, there are state-based models (e.g., Markov Chains) that represent the system behaviour and are fundamental to many modeling approaches. However, state-based models often suffer from state explosion, motivating research to reduce the state space. State aggregation is a popular model reduction method that reduces the system complexity by mapping its states into a small number of meta-states. For example, state aggregation is used for analyzing properties of exact and ordinary state lumping based on symmetries in the system [
3], dynamic load-balancing policies [
26], and reliability analysis of hybrid systems [
32]. State aggregation is also used in machine learning, where model simplification can be used to solved overfitting issues that are due to making a model more complex than necessary. While state-based models are valuable, they do not scale well enough, even when simplified, to describe the large heterogeneous layered service system with very large numbers of users, which are becoming common. Therefore, state-based models are out of the scope of this work.
Aggregation of a LQN subsystem to a multiserver provides a middle ground between a simple delay, which ignores contention in the subsystem, and an FESC, which represents its effect in detail but is expensive to build. Aggregation of groups of components as chosen by the analyst, to give a multiserver for each group was considered in Reference [
15]. A later paper [
16] showed how substantial errors can be avoided by a correct grouping of the components to be aggregated. However, this method called SPT has a serious limitation in that tasks in different groups and deployed to the same processor could not be aggregated at all. The present work completes SPT and evaluates its use.
Overall, we are unaware of any prior work on deriving a simplified layered queuing model directly from a detailed one, apart from our own previous work [
15,
16]. In particular, there is a lack of simplification techniques that avoid the scalability problems of calibrating a FESC.
6 Conclusions
Layered performance models can be successfully aggregated by preserving components with resource saturation levels above a threshold, which depends on the analyst's goals. Each model exhibits its own tradeoff between the degree of simplification and the accuracy of the approximate throughput or response time. Success was evaluated by the accuracy of the focused models and their sensitivities over many cases. Aggregation in queueing models is not exact, but the error can be controlled.
Larger models were found to simplify better, on average, than smaller ones, so simplification is most effective exactly where it is needed most. There is greater detail on this point in the thesis [
17].
Because aggregation errors are tightly linked to resource saturation levels, real systems with many lightly utilized components will tend to give more effective simplification than the random-system cases reported here, which are generated to have roughly balanced saturations.
The principal use of performance models is to study the effect of changes. For accurate sensitivity calculations the saturation ratio (between the most saturated resource and the most saturated resource that is aggregated) must be at least 2. This is part of an empirical criterion called the ASH, derived in Section
4.7. The aggregated model can also include components of special interest to the analyst; the process is under the analyst's control.
Two kinds of changes were studied in Section
4, for resource-demand parameters and in scale-related parameters. Over 90% of sensitivity results that satisfied the ASH had less than 20% error. Focused model sensitivities for demand parameter changes up to 50% were satisfactory provided the saturation ratio exceeds 3.3. Scale-related changes for scale factors up to 10 required a saturation ratio of 10, which is intuitively reasonable. Section
4 examined the sensitivity only to parameters of preserved components, but a parameter that has been aggregated can also be studied by re-applying the aggregation Equations (3) and (4).
FSPT can contribute to automated model-based performance analysis, since models made automatically are often unnecessarily complex, and advanced analysis techniques may solve the model many times (and thus can benefit from a smaller model). An example is the use of model predictions to continuously optimize the overall deployment of an application, which must complete each optimization cycle in seconds or minutes. Given that layered model solution techniques are approximately quadratic in model size [
10], model complexity must be controlled.
FSPT can simplify all layered queueing models except those with calling cycles. It removes a limitation of the previous SPT algorithm, which could not aggregate entangled tasks. Some components, however, are not aggregated: those representing replicated subsystems, those with parallel sub-paths, with “second phases” of service, with asynchronous and forwarding calls, and with priority execution (see Reference [
9] for descriptions of these features of LQNs). Such components are preserved in FSPT. Multiple classes of system users can be included by defining alternative choices made by a single pool of users.