To support decision-making while applying the ADD approach, we present a methodology for scalability assessment of alternative architectures that takes into account both logical and deployment relationships. In the following, we first introduce a high-level overview of the workflow to guide the decisions of architects (Section
4.1). Then, we dive deeper into the definition of the operational setting (Section
4.2), the measurement workflow (Section
4.4), and the analysis workflow (Section
4.5).
4.3 Load Testing
As shown in Figure
8, this stage includes two load testing sessions to collect the response time of the invocation to each SUT operation. For both sessions, the usage profile is the same. The former is referred to as the
baseline test session and the SUT is executed under a baseline deployment architecture
\(\alpha _0\) and load
\(\lambda _0\) for which the SUT is expected to operate with acceptable performance as described in Section
2.3.4. During this session, the response time of each operation is collected and the
scalability threshold in Equation (
1) is computed. The latter session is referred to as
target test session and it tests the SUT under a target operational profile defined by the selected loads
\(\Lambda\) and the alternative deployment architectures
DA. For each pair
\(({\lambda }, \alpha)\in \Lambda \times DA\) , a test is executed.
10 During each test, all invocations to SUT operations and corresponding response times are collected. The outcome of each test is the mean response time and the invocation frequency for each SUT operation.
According to our definition of architecture in Section
2,
\(\alpha\) also includes deployment aspects: deployment of servers to pods (
\(\mathit {deployment}_{sp}\) ) and deployment of pods to physical or virtual machines (
\(\mathit {deployment}_{pm}\) ). This means that the results of the tests also depend on the adopted infrastructure, that is, scalability requires the identification of proper separation boundaries in the software architecture to enable an effective exploitation of the underlying (scalable) infrastructure.
4.4 Measurement Framework
This stage is fully automated and follows the
measurement framework illustrated in Figure
9. It starts with the high-level goal of
assessing the SUT scalability in terms of its capability of meeting the performance requirement under increasing load. It processes the response time and invocation frequency of all operations to provide four metrics for the analysis stage: the
(relative) Domain Metric,
scalability footprint,
scalability gap, and
performance offset described in the following.
Relative Domain Metric.
The relative
Domain Metric measures the overall scalability of a deployment architecture at a given load, as described in [
12]. It represents the probability that the SUT with deployment architecture
\(\alpha\) does not fail under a given load
\(\lambda \in \Lambda\) and is computed as follows:
where
\(f(\cdot)\) is the discretized distribution of loads (Section
2.3.5),
n is the number of operations, and
\({s}_j^{\alpha }(\lambda)\) is the scalability share of
\(o_j\) (Equation (
3)). When no operation fails with load
\(\lambda\) ,
\(\mathcal {DM}^{\alpha }(\lambda)\) is equal to
\(f(\lambda)\) and is less than
\(f(\lambda)\) otherwise. This difference measures the
scalability degradation due to the failed operations with load
\(\lambda\) :
where
\(\mathcal {S}\) is the set of indices of the operations that fail. Thus, a failing operation contributes with its scalability share (Equation (
3)) to the overall scalability degradation of the SUT. At each load, performance degradation can be visualized as the gap between the plot of the relative Domain Metric (outermost polygon) and the one of the discretized distribution (inner polygon), as shown in Figure
10. Internal polygons approaching the outermost line yield scalability closer to optimal from a system-level perspective.
By applying the Bayesian rule, we can also compute the total
Domain Metric, as follows:
\(\mathcal {DM}^\alpha\) provides engineers with a single value that measures the overall SUT scalability with the deployment architecture
\(\alpha\) [
33].
Even though the total Domain Metric represents an effective instrument to decide over different deployment architectures, in some cases it may not explain subtle differences; thus, further investigation might be necessary. Figure
10 illustrates an example of such a situation. The two deployment architectures
\(\alpha\) and
\(\beta\) have the same total Domain Metric (0.72) but different scalability behavior over loads. To understand what is better in these cases, we need to analyze the system at a lower abstraction level. To this end, we introduce the Scalability Footprint, which represents the scalability capability per individual operation.
Scalability footprint. The scalability footprint measures the scalability level of each operation exposed by the SUT. To obtain it, we first define the
Greatest Successful Load (GSL)
\(\hat{\lambda }_j\) for an operation
\(o_j\) as the greatest load in
\(\Lambda\) for which
\(o_j\) succeeds. This implies that
\(o_j\) fails for all
\(\lambda \gt \hat{\lambda }_j\) . When
\(\hat{\lambda }_j\) equals the maximum load in
\(\Lambda ,\) \(o_j\) exhibits optimal scalability. The set of GSLs for all operations is referred to as
Scalability Footprint \(\hat{\Lambda }^{\alpha }\) of the SUT with deployment architecture
\(\alpha\) . In some cases, the footprint represents a strong boundary: when the average response time
\(\mu _j\) increases monotonically with the load, the operation
\(o_j\) also succeeds for all
\(\lambda \le \hat{\lambda }_j\) . In this case, the Scalability Footprint represents the boundary for which each operation always succeeds before and always fails after its GSL value.
11 We use the scalability footprint to compare the alternative deployment architectures in both qualitative and quantitative manners. The qualitative comparison relies on a specific visualization method, that is, a radar plot as illustrated in Figure
11. Each circle in the grid of the radar represents a load
\(\lambda \in \Lambda\) . The distance between two circles is the frequency of the discretized distribution
f at the greater load. Each closed polygon in the radar represents the scalability footprint of a deployment architecture. Each vertex of a polygon is the GSL of the corresponding operation. For example, the blue and pink polygons in Figure
11 represent the footprints of
\(\alpha\) and
\(\beta\) over the five operations
\(\lbrace o_1,\ldots ,o_5\rbrace\) . Operations exposed by the same architectural component have the same font color (e.g., red and black operations in Figure
11 belong to components A and B, respectively). The vertices either reaching or exceeding the outermost grid circle indicate optimal scalability for the corresponding operation (e.g.,
\(o_5\) ).
The quantitative comparison between alternative deployment architectures relies on the Mann-Whitney U-test statistic [
34] and the Cliff’s delta effect size to classify its magnitude [
35]. The Mann-Whitney statistic
\(U_{\alpha \,\beta }\) measures how many times the GSLs in
\(\hat{\Lambda }^\alpha\) are greater than the GSLs in
\(\hat{\Lambda }^\beta\) for the same operations. According to [
34], the Mann-Whitney effect size
\(u_{\alpha \,\beta }\) is then computed as follows:
where
\(|\cdot |\) indicates the set cardinality. To classify the effect size, the Mann-Whitney effect size is first converted to the non-parametric
Cliff delta effect size
d:
It is then classified according to the standard categorization introduced in [
36]:
•
negligible (N) effect size, if \(|d|\lt 0.147\) ;
•
small (S) effect size, if \(|d|\lt 0.33\) ;
•
large (L) effect size, otherwise.
The categories measure the strength of the difference between two footprints. For instance, the value \(U_{\alpha \,\beta }=13\%\) with large (L) effect size indicates that architecture \(\alpha\) is largely (L) less scalable ( \(13\%\lt 50\%\) ) than \(\beta\) .
This quantitative evaluation can be applied considering all of the operations of the SUT or just those exposed by a specific component depending on the desired granularity level.
Scalability gap and performance offset. With the aim of providing engineers with additional information at the operation level, we define two additional metrics: the scalability gap and the performance offset of failing operations. The former measures the scalability loss and the latter the performance loss due to a failing operation. The
scalability gap \(\mathit {SG}_j\) of a failing operation
\(o_j\) is the scalability share (Equation (
2)) at the minimum
\(\lambda \in \Lambda\) greater than the GLS of
\(o_j\) :
The
performance offset \(\mathit {PO}_j\) of a failing operation
\(o_j\) is the distance from the mean response time to the scalability threshold
\(\Gamma ^{0}_j\) at the minimum
\(\lambda \in \Lambda\) greater than the GLS of
\(o_j\) :
We visualize the scalability gaps and performance offsets of failed operations by means of histograms. Figure
12 shows the scalability gap under two alternative deployment architectures
\(\alpha\) and
\(\beta\) . Horizontal lines represent the loads in
\(\Lambda\) . A bar lying on an horizontal lines at
l —say,
\(l=250\) — represents the scalability gap of an operation (
\(o_5\) ) that has the GSL value equal to the maximum
\(\lambda\) smaller than
l (
\(\hat{\lambda }_5=200\) ). Thus, higher bars correspond to failing operations having higher negative impact on SUT scalability. We use the same representation to compare the performance offsets: higher bars correspond to failing operations having higher impact on the performance degradation of the SUT.
4.5 Analysis Workflow
The analysis workflow is tailored to understand (Figure
13(a)) and then improve (Figure
13(b)) the scalability of the system. The analysis starts from the following three goals:
•
G1: Understand the scalability of the SUT with a given architecture \(\alpha\) .
•
G2: Compare the scalability of the SUT with alternative architectures \(\alpha\) and \(\beta\) .
•
G3: Improve the scalability of the SUT.
The decision process is then guided by questions derived from the goals. Our visualization and measurement framework is then used to answer the questions listed in Figure
13.
Understanding the target deployment architectures. Considering
G1, the workflow starts by visualizing the
\(\mathcal {DM}\) polygon of a deployment architecture
\(\alpha\) to determine whether the architecture satisfies a scalability requirement. Specifically, the architect would like to answer
Q1 and
Q2 for
\(\alpha\) . If the corresponding
\(\mathcal {DM}\) polygon is very close to the outer one, the architect may consider
\(\alpha\) as optimal. If so, the analysis ends with the decision:
no change in the architecture is needed and \(\alpha\) is recommended. In case the architect observes scalability issues for some loads (e.g.,
\(\lambda = 150\) and architecture
\(\alpha\) in Figure
10), the architect can compute the total
\(\mathcal {DM}\) value and answer
Q3. If the total
\(\mathcal {DM}\) is far from optimal, the architect can then consider
G2 and compare the polygon
\(\alpha\) and the total
\(\mathcal {DM}\) with alternative deployment architectures to answer
Q4. For instance,
\(\beta\) and
\(\delta\) are two alternative deployment architectures in Figure
10. Compared to
\(\alpha\) and
\(\beta\) ,
\(\delta\) has no scalability issues up to
\(\lambda = 150\) , but the overall
\(\mathcal {DM}\) is lower. With this information, the architect can understand that
\(\delta\) represents a worse choice.
If the
\(\mathcal {DM}\) polygons and the total
\(\mathcal {DM}\) value are not sufficient to identify the deployment architecture closer to optimal, the architect uses the scalability footprints and applies a pairwise comparison of the alternative deployment architectures considering the effect size and magnitude. As an example,
\(\alpha\) and
\(\beta\) in Figure
10 have the same total
\(\mathcal {DM}\) value but different
\(\mathcal {DM}\) polygons that may be equally good. In this case, differences may emerge analyzing the scalability footprints. In case multiple architectures still exhibit similar scalability, the architect can refine and improve the deployment configuration (e.g., allocation of resources to components or operations) of one or more architectures and then apply our approach to compare them, as follows.
It is worth noting that both
\(\alpha\) and
\(\beta\) are initial assignments that improved over one or more iterative steps. These assignments are usually based on domain knowledge (coming from the analysis of historical data [
27]) or can be produced by using sampling techniques [
37] driving the selection of the initial and subsequent candidate sets of architectures/configurations.
Improving the target deployment architectures. Starting from
G3, the workflow considers the scalability footprints of one or more deployment architectures to answer
Q5 and
Q6. The architect compares the effect size and magnitude of the scalability footprints of alternative deployment architectures and selects the components and their operations that require further investigation. The scalability footprints in the radar plot show the failing operations at each load level and the overall differences for components, as for components
\(\mathtt {A}\) and
\(\mathtt {B}\) in Figure
11. For instance, the footprint of
\(\beta\) shows worse scalability for component
\(\mathtt {A}\) and better scalability for
\(\mathtt {B}\) . The effect size and magnitude calculated at component level may further indicate that
\(\mathtt {B}\) is largely more scalable with
\(\beta\) and
\(\mathtt {A}\) is less scalable with
\(\beta\) but by a negligible difference with
\(\alpha\) . In this case, the architect can choose
\(\beta\) and change the deployment configuration for component
\(\mathtt {A}\) to improve the overall scalability of the SUT.
By following the workflow, the architect analyzes each operation by means of the scalability gap and performance offset. The main objective here is to identify failing operations whose impact is higher on performance and scalability. Operations associated with the largest impact yield a severe negative effect that can be mitigated by using suitable architectural choices and resource allocation (within the limits of the physical constraints of the underlying hardware). As an example, by inspecting Figure
12, we can observe that with
\(\beta\) , the operations
\(o_1\) , and
\(o_2\) exhibit a scalability gap at load 150, while
\(o_3\) yields a scalability gap at load 200. With this information, the architect can allocate more resources to
\(\mathtt {B}\) (operations
\(o_1,\) \(o_2\) ,
\(o_3\) ). Considering instead
\(\alpha\) , the operation
\(o_5\) fails at load 250 with a large scalability gap, whereas all other operations fail at load 200 but with a smaller gap. This means that the operation
\(o_5\) of
\(\mathtt {B}\) is the most critical one since its invocation frequency is higher compared with the other operations.