4.3.1 Steady State Analysis Experiment Results.
We perform a steady state analysis experiment on each selected Ethereum client, per Section
4.2.1. Every client is observed for two monitoring epochs of 5 hours, amounting to 10 hours in total. Within each epoch, the metrics of interest are recorded and aggregated for every 15-second monitoring interval.
Figure
3 depicts the distributions of metrics in GoEthereum and Nethermind. In the case of GoEthereum, the client exposes 400 metrics in total.
ChaosETH analyzes all of them to identify proper metrics for chaos engineering experiments. As not all of the client’s features are activated using the recommended configuration, 164 out of these 400 metrics are inactive, meaning that the value of these metrics never change or cannot be queried. The comparison of the distributions of active metric values in two different monitoring epochs shows evidence that 44 metrics are statistically stable. As mentioned in Section
3.4.1, the steady state analyzer utilizes the Mann Whitney U test for distribution comparison. In all of the experiments, we use a p-value of 0.03. Under a confidence level of 0.01, the null hypothesis that the two samples are not statistically distinguishable is not rejected. In the case of Nethermind, 231 metrics are analyzed by the steady state analyzer, 115 metrics are inactive during the experiment. Regarding the 116 active metrics, 55 are statistically stable and can be used for further experiments.
In Table
1, we display the evolution of metric samples during the steady state experiment. The first half of each evolution chart (in blue) is based on the data gathered during the first monitoring epoch. The second half of the chart (in red) is drawn based on the data of the second monitoring epoch. The last two columns indicate the p-values, obtained after applying the Mann–Whitney U test, to test for the similarity between the two distributions, and the result of the test. For example, the first row in Table
1 shows that the number of account flush operations made by the GoEthereum client regularly has spikes during these two monitoring epochs.
An example where the null hypothesis is rejected at confidence level 0.01 is metric
json.rpc.requests(count/s) in the Nethermind client. From Table
1 the line chart in the second-to-last row also visually confirms that the metric does not evolve in the same way during the two monitoring epochs. Considering the confidence level of 0.01, this metric is not stable enough to describe a client’s steady state and thus is excluded from further experiments.
This experiment shows that not all the monitoring metrics provided by an Ethereum client are suitable to describe the client’s steady state in a statically valid manner. Since the experiments are done in production, there are several factors that could affect a metric’s stability. First, the node itself is not always stable. For instance, there exist other applications that take more resources from the node. Second, the network may not be stable. It is possible that the node encounters network scans or attacks every now and then [
16]. Last, the behaviors of peers are different. For example, when the node randomly connects with some new peers who have different characters, a metric might be influenced.
4.3.2 Chaos Engineering Experiment Results.
From the experiment for RQ1, we know that GoEthere-um runs with 10 different types of system calls, accumulating more than 288 million invocations in a 10–hour production run (two monitoring epochs). Interestingly, none of the types of system call invocations has a 100% success rate. We perform chaos engineering by increasing the error rates of those system calls in production. The error rate amplification approach described in Section
3.4.2 produces 15 and 12 realistic error models, respectively, for the GoEthereum client and the Nethermind client. For each error model, we make a chaos engineering experiment with a one-to-one mapping.
Table
2 describes the error models together with the chaos engineering experiments of the selected clients. Every row presents one error injection model, including the target system call invocation (column Syscall), the error code to be injected, and the error rate. The last five columns give the corresponding experiment result, including the total number of injected errors, the number of evaluated metrics, and the results of whether the three hypotheses (
\(H_N\) ,
\(H_O\) ,
\(H_R\) ) are verified or falsified with respect to a metric. The metrics that fail the pre-check phase are excluded from the other phases, since
ChaosETH considers them not stable enough for behavior comparison. When the client does not invoke a type of system call during the experiment,
ChaosETH does not inject any error related to that type of system call, and the corresponding row is omitted in the table.
For the GoEthereum client, ChaosETH conducts 12 chaos engineering experiments. The results show that 5 out of 12 error models crash the GoEthereum client (the rows whose \(H_N\) column is marked with “X”). For the other 7 error models, 6 of them have a visible effect on the monitoring metrics (the rows whose \(H_O\) column contains a non-zero value). For example, when ChaosETH uses the error model (accept, EAGAIN, 0.6) for experiments, 24 metrics are stable during the pre-check phase. During the error injection phase, 18 metrics are observed to deviate from their normal behavior. After stopping injecting the errors, 16 out of these 18 metrics recover to the normal state after the recovery phase. This confirms that the GoEthereum client is resilient to EAGAIN errors in accept4 with respect to these 16 metrics.
Regarding the Nethermind client, there are 10 chaos engineering experiments in total (second half of Table
2). The results show that two error models,
(futex, EAGAIN, 0.05) and
(futex, ETIMEDOUT, 0.05), lead the Nethermind client to a crash. Seven error models cause a visible effect on at least one metric during the error injection phase. The error model
(recvfrom, EAGAIN, 0.549) does not cause any impact on all of the 48 metrics that pass the pre-check. In this case,
ChaosETH does not check the
\(H_O\) hypothesis, because no metric deviates from its steady state even during the error injection phase.
This experiment has five main outcomes, with different meanings for the Ethereum developers.
Crash ( \(H_N\) =X). The client directly crashes because of the injected errors. This is considered as a severe case: This means that an Ethereum node disappears from the distributed consensus and validation process. As the client crashes, the hypotheses
\(H_O\) and
\(H_R\) cannot be tested and are marked as “-” in Table
2. For example,
ChaosETH detects that the GoEthereum client directly crashes when an
EAGAIN error code is injected to the system call
write. Since error code
EAGAIN in Linux means that the target resource is temporarily unavailable, crashing is an over-reaction; the client should consider implementing a classical retry mechanism instead of crashing directly.
Invisible effect ( \(H_N\) = \(\checkmark\) and \(H_O\) =0). In some cases, there is no visible effect detected during the error injection phase. For example, the Nethermind chaos experiment using error model (recvfrom, EAGAIN, 0.549) reveals such a situation. In this experiment, ChaosETH injects 51,715 system call invocation errors to system call recvfrom. During this error injection phase, none of the 48 metrics have an abnormal behavior. This indicates that the Nethermind client seems to be functioning normally when a system call invocation to recvfrom returns an EAGAIN error code, which can potentially signify resilience. However, we cannot exclude that the client state is corrupted in an invisible manner, because we do not have a provably perfect steady state oracle. Since ChaosETH does not capture anything abnormal during the error injection phase, the verification of hypothesis \(H_R\) is skipped. Overall, the presence of such invisible effect cases is good with respect to consistency: If we would not perform steady state pre-checking and observability hypothesis checking, then developers may falsely believe that the client state is valid according to the monitored metrics.
Long-term effect ( \(H_N\) = \(\checkmark\) , \(H_O\) = \(\checkmark\) , and \(H_R\) =X). For some of the error models, the client under experiment does not crash. However, during the error injection phase, some metrics deviate from their steady state and do not recover after the given recovery phase. For instance, the experiment result of error model (accept4, EAGAIN, 1) in the Nethermind client belongs to this category. During the error injection phase, it shows that metrics eth66get_block_headers_received/s, local_receive_message_timeout_disconnects/s, process_private_memory/s, and process_virtual_memory/s deviate from their normal behavior. However, after the recovery phase, only metrics process_private_memory/s and process_virtual_memory/s recover to the steady state. The other two metrics stay abnormal during the validation phase. This means that either it takes a longer time for the client to recover from the injected errors or that the injected errors lead the client to a stalled or corrupted state. Overall, such cases show that ChaosETH gives Ethereum developers insights about the timespan of recovery.
Resilient case ( \(H_N\) = \(\checkmark\) , \(H_O\) = \(\checkmark\) , and \(H_R\) = \(\checkmark\) ). Certain error models do not crash the client and also cause visible evidence of resilience. After the error injection stops, the monitoring metrics recover to their steady state. This indicates that the target client is equipped with an effective, graceful error-handling mechanism that brings the client back to normal after errors. For example, during the chaos engineering experiment using error model (connect, EINPROGRESS, 0.8) in the GoEthereum client, the injected errors do not crash the client, thus the \(H_N\) hypothesis holds. During the error injection phase, the metric geth.txpool.slots.gauge/s no longer matches the steady state. When the error injection stops, the client’s behavior related to the transaction pool slots is restored during the recovery phase. During the validation phase, ChaosETH checks the metric again and confirms that geth.txpool.slots.gauge/s has recovered to its steady state. By looking at the client logs, we indeed confirm that the client has resumed downloading, sharing, and verifying Ethereum blocks.
4.3.3 Benchmarking Ethereum Clients.
We cannot strictly compare the considered clients based on the results of RQ2, because the error models are different. To overcome this, we have introduced in Section
4.2.3 the idea of testing the clients under a meaningful common error model.
ChaosETH identifies four common error models for the selected clients. The results of this resilience benchmarking experiment are summarized in Table
3. Each row in the table presents the verification of the three hypotheses for both clients, according to a set of client metrics. Only the metrics that pass the pre-check phase are selected for hypothesis verification. This table is interesting in the following three aspects:
First, regarding the \(H_N\) hypothesis (absence of crash), the results show that both the GoEthereum client and the Nethermind client crash under the same specific error models. These two clients are crashed by futex system call invocation errors with codes EAGAIN and ETIMEDOUT. Overall, there is no client that is absolutely more robust than the other with respect to crashing.
Second, focusing on the \(H_O\) hypothesis (observability), when the error model (accept4, EAGAIN, 1) is used for experiments, both the GoEthereum client and the Nethermind client are observed to have abnormal behavior with respect to metrics. For the GoEthereum client, 9 metrics become abnormal during the error injection phase. Regarding the Nethermind client, 6 metrics deviate from the steady state. This is evidence that the metrics capture the client’s internal state and that not all clients have the same observability.
Third, considering the \(H_R\) hypothesis, ChaosETH successfully identifies resilient cases for the two Ethereum clients. ChaosETH shows that the GoEthereum client is resilient to error model (accpet4, EAGAIN, 1) with respect to metrics geth.p2p.peers.gauge/s and geth.txpool.reheap.timer/s. For the same error model, the Nethermind client is resilient with respect to metrics nethermind_mod_exp_precompile/s, nethermind_state_db_reads/s, and nethermind_useless_peer_disconnects/s. As opposed to toy examples with perfect oracles, assessing behavior of real-world software through monitoring yields multiple shades of resilience.