1 Introduction
Network services are specific implementations of all kinds of network protocols, which define how different entities communicate in the network. However, they introduce more threats to computer systems than local applications since it is much easier for attackers to exploit vulnerabilities in network services to launch remote attacks than in local applications.
For example, the Heartbleed [
2] vulnerability from one of the most famous implementations of the Transport Layer Security (TLS) protocol [
1] — OpenSSL [
9], could be used by malicious attackers to leak confidential data in the memory of remote devices. In addition, the vulnerability in the implementation of Microsoft’s Server Message Block (SMB) protocol [
6] has also led to a worldwide WannaCry ransomware cyberattack [
8]. Since OpenSSL is a widely used library for TLS encryption communication and the vulnerable SMB protocol runs on countless Microsoft Windows OS devices, such vulnerabilities have a vast range of influence. Therefore, vulnerabilities of network services are significant threats to the entire cyberspace, and it is vital to discover vulnerabilities in such targets.
Fuzzing is one of the most popular vulnerability discovery techniques. It has been widely used and studied in both academia and industry due to its ease of usage, high efficiency, and low false positives. In the early days, fuzzers for network services mainly worked in black-box style [
37,
41], which blindly and continuously generated sent messages to the service under test (SUT) located at a given IP address and port. Although black-box fuzzing is easy to launch, it is relatively blind due to lacking the internal feedback of the SUT during fuzzing, which leads to limited code coverage and vulnerability discovery effectiveness. In recent years, grey-box fuzzing solutions combining genetic algorithms and code coverage feedback have become increasingly popular [
3,
26,
47]. For instance, the representative fuzzer AFL [
47] has dramatically improved the code coverage and overall fuzzing effectiveness on most of the command-line applications, such as
readelf [
5].
However, traditional grey-box fuzzing approaches cannot be directly well applied to network services due to two main challenges: (1) Service state representation. Most existing grey-box fuzzers are mainly designed for stateless local applications. As for protocol-based network services, on the one hand, network services respond differently according to the current session state when receiving the same input message; on the other hand, most bugs are stateful, which can be triggered only by a sequence of specific messages. Hence, grey-box fuzzing solutions without awareness of service states could not acquire complete feedback, which would mislead the evolutionary direction of genetic algorithms. (2) Testing efficiency. Network service programs are always designed as C/S architecture. Action usually involves multiple network I/O interactions, which means that an effective fuzzer needs to conduct multiple interactions with the target service. Hence, fuzzers should send each message to the target service in time to save testing time and improve the testing throughput.
Notably, some recent research works have introduced grey-box fuzzing for network services. AFL
net [
39] first proposed a grey-box fuzzing solution targeted at stateful protocol implementations. It extracted the response code from the response messages to represent the service states, then used the response code sequence to infer a state model of the protocol implementation and further utilized the inferred model to guide the fuzzing process. S
tateAFL [
33] attempted to use programs’ in-memory states to represent the service states, then performed state collection and state model inference by instrumenting the SUT. In each round of network interaction, S
tateAFL dumped program variables to an analysis queue and performed post-execution analysis to update the state model.
However, existing works still suffer from the aforementioned two challenges. As for the state representation challenge, the response code scheme proposed by AFL
net assumes that the protocol will embed special code in response messages, which is not always the case. In addition, as pointed out in S
tateAFL, the indication of the network service state provided by the response code is not robust. To overcome the limitation of the response code–based method, S
tateAFL used the program in-memory state to represent the service state. However, due to the complexity of the program’s in-memory state, it is unrealistic to map such contents into the service state directly. Hence, S
tateAFL used locality-sensitive hashing to approximate the state mapping, introducing less accurate state representation. SGFuzz [
18] proposed to use state variables to represent the state of network services. It automatically recognized so-called state variables and used them to build a state transition tree (STT), which was considered to represent the explored state space of the service program. However, SGFuzz may introduce false positives in the state representation since SGFuzz directly uses variables with enumeration types as state variables without filtering. Regarding the testing efficiency challenge, since there is no clear signal indicating the message process of the SUT, both AFL
net and S
tateAFL use a fixed timer to control the fuzzer to send messages to the SUT. However, the time window of the timer is either too short (in which case the SUT will miss messages sent by the fuzzer) or too long (in which case the fuzzer will waste too much time waiting). S
tateAFL also requires post-execution analysis for state sequence collection and state model inference, introducing additional runtime overhead and further lowering the testing throughput.
In this article, we propose NSFuzz, an efficient state-aware grey-box fuzzing solution for network services. We have studied many representative network service programs to understand their typical implementations. We found that such programs always use program variables to describe the service states directly. We also noticed that the network services always come with a network event loop, which is responsible for continuously processing incoming messages. Hence, to address the first challenge, we propose a lightweight variable-based state representation scheme to represent the network service state. We refer to the variable denoting the service state as the state variable. Since state variables contain the inherent semantic information of the service program, the variable-based state representation scheme could represent the service state with higher accuracy and interpretability. As for the second challenge, the intrinsic event loop of network services could yield appropriate signal feedback, enabling an efficient I/O synchronization between network services and the fuzzer. We refer to the location in the event loop where the signal feedback can be raised as the I/O synchronization point since it is responsible for synchronizing the I/O interaction with the fuzzer. Signal-based synchronization could facilitate the fuzzer sending new messages to reduce waiting time overheads. This mechanism could also enable the fuzzer to collect state transition sequences and infer the state model actively, thereby avoiding heavy post-execution analysis used by StateAFL.
We use both static analysis and annotation APIs to identify I/O synchronization points and state variables from the source code of the SUT. Then, we conduct lightweight compile-time instrumentation to enable the service with signal-based fast I/O synchronization and variable-based service state tracing capability. Finally, we use the instrumented target service to carry out efficient state-aware network service fuzzing. Currently, we have implemented a prototype of NSFuzz.
The evaluation results showed that NSFuzz could infer a more accurate state model during the fuzzing process and has a significantly higher fuzzing throughput than AFLnet and StateAFL. In addition, NSFuzz could reach higher code coverage and trigger more crashes in less time.
In summary, this article makes the following contributions:
•
We propose a variable-based state representation scheme to represent the network services state and infer more accurate state models during fuzzing. We design an efficient I/O synchronization mechanism based on the network event loop of the SUT, which enables a much higher throughput for network service fuzzing.
•
We present NSFuzz, an efficient and state-aware network service fuzzing solution. We use static analysis and annotation API to identify the synchronization points and state variables within the SUT, then enable it with signal feedback and state feedback capabilities through compile-time instrumentation for efficient state-aware fuzzing.
•
We have implemented a prototype of NSFuzz and evaluated it on several real-world network services provided by ProFuzzBench [
34]. The evaluation results showed that NSFuzz could infer a more accurate state model and achieve better fuzzing throughput than the state-of-the-art network fuzzers. As a result, the overall fuzzing effect of NSFuzz is better than other solutions on most targets, which includes higher code coverage and more crashes triggered in less time. In addition, NSFuzz found 8 zero-day vulnerabilities in the latest version of three popular network services, which shows its ability to find real-world vulnerabilities.
5 Evaluation
We have built a prototype of NSFuzz. The implementation of NSFuzz is about 4.5 k lines of C/C++ code and about 100 lines of Python script. In detail, we implement the static analyzer, annotation parsing engine, and compile-time instrumentation based on LLVM [
11] framework, and the fuzzer engine is implemented based on AFL
net (revision 0f51f9e from January 2021). To elaborate on the evaluation of NSFuzz, we have performed several experiments to answer the following research questions:
•
RQ1, Fuzzing efficiency of NSFuzz: Could NSFuzz bring higher fuzzing efficiency based on the efficient I/O synchronization mechanism during the fuzzing process?
•
RQ2, Accurateness of the inferred state model inferred by NSFuzz: Could NSFuzz infer a relatively more accurate state model based on the variable-based state representation scheme during the fuzzing process?
•
RQ3, Overall effectiveness of NSFuzz efficient state-aware fuzzing: Could NSFuzz achieve better overall fuzzing results than other existing approaches?
•
RQ4, State space exploration ability of NSFuzz: Could NSFuzz achieve higher state space coverage than other approaches?
•
RQ5, Real-world bug finding ability of NSFuzz: Could NSFuzz find bugs in real-world protocol services continually?
5.1 Experiment Setup
We selected fuzzing targets from the network protocol fuzzing benchmark ProFuzzBench [
34] to evaluate NSFuzz. ProFuzzBench is a benchmark for stateful protocol fuzzing. It contains 13 network service implementations from 10 network protocols (including FTP, SMTP, SIP, etc.). It covers various network protocols based on TCP and UDP, with all implemented in C/C++. In addition, ProFuzzBench applies necessary patches (such as derandomization) for these network services to ensure the reliability of the fuzzing evaluation. To make a more thorough evaluation, we chose all 13 network services in ProFuzzBench as the evaluation targets. Table
1 shows the information of the target services.
To make the comparison, we selected two state-of-the-art grey box network fuzzers, AFL
net 1 (reversion 0f51f9ed from January 2021) and S
tateAFL
2 (reversion c1b2aee from October 2021), and another network-enabled version of AFL, AFL
nwe 3 [
38] (reversion 6ba3a25 from March 2021) as baseline fuzzers to evaluate NSFuzz. AFL
net uses message response codes to represent the service state and inference the state model, then conducts state-guided fuzzing based on this model during the fuzzing loop. S
tateAFL collects the changed variables during network I/O rounds. Then, it extracts state variables and infers the state model through post-execution analysis. AFL
nwe is another network service fuzzer proposed by the author of AFL
net. It only changes the file I/O interface from the original AFL to socket-based network I/O to achieve network service fuzzing. To evaluate the effects of each component of NSFuzz during the fuzzing process, we also proposed a fuzzer NSFuzz-V, which enabled only the variable-based state representation scheme without the I/O synchronization mechanism. Then, we performed an ablation study on the overall effectiveness evaluation (RQ3) and state space coverage evaluation (RQ4) of fuzzing experiments.
All experiments were running on the same testing machine during this evaluation. This testing machine contains 128 Intel(R) Xeon(R) Platinum 8358 CPUs and 384 GB of memory with SSD disk. We set up each target service with each fuzzer in separate docker containers and used the same computing resource for experimental evaluation. We fuzzed each target service with different fuzzers for 24 hours and repeated 4 times each for a total of 6240 CPU hours of fuzzing evaluation.
5.2 Fuzzing Efficiency Evaluation (RQ1)
5.2.1 Static Analysis and Annotation.
In order to achieve efficient I/O interaction during the network service fuzzing, NSFuzz first used static analysis and annotation API to identify the I/O synchronization points in the target services. Table
2 shows the static analysis and annotation results of NSFuzz for 13 target services. As can be seen from the table, the static analyzer could identify the network event loop in 9 target services. Among them, some of the targets’ event loops could be identified automatically, and the loop structure entry can be directly used as the I/O synchronization point. However, as mentioned earlier, the event loop in some targets is not implemented by a single loop structure but rather by using the multi-level or event-driven framework. In these cases, multiple I/O synchronization points may be required to achieve synchronization for various request messages. In addition, it may be difficult to identify the event loop in some C++ target services due to the indirect calls generated by virtual functions. Therefore, for the targets that cannot be fully adapted by static analysis to identify the event loop, we use the annotation API provided by NSFuzz to calibrate or add I/O synchronization points manually. It should be noted that the I/O synchronization point annotation does not need too much manual work, and it is a one-time effort for each target service. For people unfamiliar with these target services before, the average time required to annotate the I/O synchronization point within different targets is from several minutes to up to 2 hours.
5.2.2 Fuzzing Throughput.
After identifying the I/O synchronization points and instrumenting the target services, NSFuzz performed efficient fuzzing on network services by fast I/O synchronization. Table
3 illustrates the average fuzzing throughput of NSFuzz and other fuzzers among 4 runs of experiments on different target services. The fuzzing throughput means the number of test cases executed per second. Obviously, a higher fuzzing throughput indicates that more test cases have been executed simultaneously, thus, the overall test efficiency is also higher.
As shown in Table
3, the fuzzing throughput of NSFuzz is significantly better than that of AFL
net and other fuzzers. NSFuzz has improved the throughput from 1.8
\(\times\) to more than 200
\(\times\) on different target services and brings an average improvement of 24
\(\times\) . As expected, the I/O synchronization scheme introduced by NSFuzz significantly improved the fuzzing efficiency. In addition, we can see from the result that S
tateAFL has the lowest throughput among the four fuzzers. Since S
tateAFL needs to collect variable values during the fuzzing process for post-execution analysis to perform state model inference, the additional overhead introduced by S
tateAFL may always cause a decline in its fuzzing throughput compared with AFL
net. It is also worth noting that AFL
nwe, the only fuzzer without state-aware, also improves fuzzing throughput compared to AFL
net. This is mainly because AFL
nwe is just a network I/O–enabled version of AFL, which only sends one-time data to the SUT, thus saving the waiting delay time and reducing state-related overhead among multiple network interactions. However, even in this case, its fuzzing throughput is still not as good as NSFuzz, which proves that the efficient synchronization mechanism of NSFuzz based on lightweight instrumentation could significantly improve the fuzzing efficiency.
5.3 State Model Inference Evaluation (RQ2)
5.3.1 Static Analysis and Annotation.
As before, NSFuzz also needs to perform static analysis and use annotation API to extract the state variables in the target services first. Table
4 shows the static analysis and annotation results of NSFuzz for 13 target services.
As we can see, NSFuzz could extract state variables via automatic static analysis in 9 target services, which involves multiple network protocol types. In addition, the number of extracted state variables and analysis time are generally positively correlated with the scale of the target services (LoC [line of code]), which is consistent with intuition. It should be noted that the name of the extracted state variable could sometimes directly indicate its function of representing the network service state. For example, one of the state variables extracted from Pure-FTPD [
23] by static analysis is
loggedin, which is a global variable used to indicate whether the incoming client session has completed the FTP login authorization. Moreover, the message handler could execute different code logic for the same request message according to whether the client session has completed the authorization in Pure-FTPD.
For targets not using a single loop structure to implement the network event loop, the static analyzer may fail to extract the state variables because it needs to start the analysis process from the entry of the loop. Therefore, we use the state variable annotation API provided by NSFuzz to annotate the state variables within them directly. Although multiple heuristic rules have been performed, targets that static analyzer could work on may still contain false positives, such as some config flag and message type variables. Hence, we also use the annotation API to refine the output from the static analysis and annotate only those variables that could explicitly represent the service state. Similarly, this lightweight state variable annotation mechanism needs only a one-time effort for each target service. For people unfamiliar with these target services, the average time required to annotate the state variables under different targets manually is about 20 minutes, and the number of final annotated variables is usually around 10.
5.3.2 Inferred State Model.
After extracting the state variables and completing the instrumentation, NSFuzz performed state-aware fuzzing on target network services. Like AFL
net and S
tateAFL, NSFuzz would also infer a state model for the SUT during the fuzzing process. Table
5 shows the average number of vertexes and edges of the state model inferred by these fuzzers for the target service in the 24-hour fuzzing experiment.
To investigate the accuracy of the inferred state model, we take LightFTP [
4] as an example for a case study because LightFTP annotates only one state variable; thus, the semantic information is more clear. After 24 hours of fuzzing of LightFTP, NSFuzz inferred the same state model in all four runs with 5 vertexes and 11 edges, as shown in Figure
4. The annotated state variable
Access is used to represent the access authority of client sessions. After analyzing the source code manually, we found that
Access has 4 constant values to represent different permission of the client user (NOT_LOGGED_IN, READONLY, CREATENEW, FULL), and LightFTP would conduct different message handling processes according to the client permission. The state model inferred by NSFuzz contains all 4 states with an additional initial dumb state, which showed that NSFuzz could accurately infer all states during the fuzzing process on LightFTP, thereby establishing a direct mapping between state variables and the inferred state model. However, the state model inferred by AFL
net and S
tateAFL with the same initial seeds respectively had 23 vertexes/220 edges and 51 vertexes/254 edges on average at the end of fuzzing. According to our manual analysis, these state models inferred by AFL
net and S
tateAFL could not distinguish the different permissions of client users, which may lead to incomplete state guidance. Moreover, these models are also difficult to reflect a clear relationship with the target service. Therefore, to a certain extent, the state model inferred by NSFuzz is relatively more accurate and interpretable than other works.
It is worth noting that even though for all FTP protocol services, the inferred state model (number of vertexes and edges) in different implementations would also be different. This is because various network service implementations for the same protocol may define their state model with specific semantics. For example, variables used to mark the service mode may also be treated as a state variable broadly, which could bring an extension for the basic state model. Moreover, since the state variable extraction part in NSFuzz is a decoupled model from the fuzzing loop, users could also choose to annotate which state variables to monitor, indicating that NSFuzz has the ability to construct different granularities of the state model for network service.
5.4 Overall Effectiveness Evaluation (RQ3)
To evaluate the overall fuzzing effectiveness of NSFuzz, we have counted the average code coverage and crash trigger of NSFuzz and other fuzzers among 4 runs of experiments. Moreover, in order to explore the impact of each part (i.e., the variable-based state representation scheme and the I/O synchronization points-based speed-up scheme) of NSFuzz on the overall effectiveness of network service fuzzing, we also introduced a fuzzer, NSFuzz-V, which enabled only the variable-based state representation scheme, to conduct an ablation experiment.
5.4.1 Code Coverage.
Code coverage is always a standard metric for evaluating fuzzers, which indicates how much code in the SUT has been executed during the whole fuzzing process. Usually, the higher the code coverage, the more program vulnerabilities may be triggered. Table
6 illustrates the average final code branch coverage of various fuzzers towards each target service during 4 instances of 24-hour fuzzing.
Results in Table
6 show that NSFuzz could achieve a higher code branch coverage than AFL
net on all 13 targets, which proved the effectiveness of our proposed methods in improving the code coverage. The results also show that on different targets, NSFuzz-V achieves better or slightly less code coverage than AFL
net and has an average improvement of
\(2.11\%\) . The results of NSFuzz-V indicate that the modification of the state variables representation scheme alone is able to improve the code exploration ability of the fuzzer. However, NSFuzz-V still has a decline of code coverage on 5 of the targets: Forkked-daapd, Bftpd, ProFTPD, OpenSSH, and OpenSSL. After analysis, the throughput of NSFuzz-V has a relatively large drop in these targets. This indicates that the overhead brought by the state variable instrumentation has a negative impact on code exploration. Fortunately, the negative impact can be made up for by the I/O synchronization mechanism of NSFuzz.
Although AFL
nwe has a relatively high fuzzing throughput among these fuzzers (see Section
5.2.2), it could not achieve good code coverage on specific targets such as Exim, LightFTP, and TinyDTLS due to its lack of multiple network I/O interactions capability. In addition, the average number of code branches covered by S
tateAFL during the 24-hour fuzzing is not significantly different from AFL
net, which indicates that although S
tateAFL proposed a more reasonable state representation scheme, fuzzing speed is also a significant influence on the final code coverage.
Figure
5 shows the growth of the number of code branches explored with the fuzzing time during the 24-hour fuzzing process among several fuzzers. As can be seen from the figure, NSFuzz could not only cover more code branches on most target services but also could explore the branches much faster than any other fuzzers.
However, there are two exceptions in which NSFuzz performs not as well as AFLnwe: Dcmtk and Live555. Dcmtk is used for image processing and storing, which has relatively less state transition than other targets. As for Live555, it supports the streaming data processing of protocol messages, which is not in line with the I/O model assumed in other fuzzers. Therefore, even the most straightforward one-time I/O stateless fuzzer AFLnwe could outperform other stateful fuzzers. NSFuzz-V performs better than AFLnet on most targets but is still not as good as AFLnwe on some targets. Except for Dcmtk and Live555 discussed earlier, the testing throughput is the main factor that prohibits the performance of NSFuzz-V. However, we can see from the figures of Dcmtk, ProFTPD, Pure-FTPd, OpenSSH, and so on that although NSFuzz-V may lose the advantage at the beginning, it has the ability to catch up later. This shows the positive effect of the state representation scheme and explains why NSFuzz has a better performance after combining the speed-up solution.
5.4.2 Crash Trigger.
In addition to the code coverage, the triggering of crashes directly reflects the vulnerability discovery capabilities of fuzzers. Table
7 shows the number of crashes and vulnerabilities of the target services triggered by various fuzzers during the 24-hour fuzzing experiment. We use the program crash address reported by AddressSanitizer [
42] of the target service to cluster the number of crash triggers. Then, we determine the service vulnerabilities by manually analyzing the program crashes. It should be noted that the same vulnerability may cause the program to crash at different locations; thus, the number of triggered crashes would always be more than the actual number of vulnerabilities. As shown in this table, NSFuzz could always trigger more or equal crashes and vulnerabilities in these target services than other fuzzers. Compared with the other fuzzers, NSFuzz-V also shows improvement on some targets. Fortunately, all of these vulnerabilities have been fixed by the vendors in their latest version.
We also calculated the average time for various fuzzers to trigger the first crash and the number of runs that the trigger of the crash happens during 4 runs of experiments. The results are shown in Table
8. As we can see, except for Dcmtk, the average time to trigger the first crash of NSFuzz is significantly lower than other competitors. Especially during the fuzzing process of TinyDTLS [
12], NSFuzz could always trigger the program crash in less than 1 s, which to some extent shows the high efficiency of NSFuzz to discover vulnerabilities. As for Dcmtk, the average time for the first crash of NSFuzz is larger than that of AFL
net and S
tateAFL. However, NSFuzz can stably trigger crashes in every run, which is not the case for AFL
net and S
tateAFL.
5.5 State Space Coverage Evaluation (RQ4)
To evaluate the state space exploration ability of NSFuzz, we check the state space coverage of different fuzzers within 24 hours of fuzzing. Since we propose to use state variables to represent the state of the SUT, the value range of the state variables constitutes the state space. Thus, we use the proportion of the explored values of state variables to represent the state space coverage.
By analyzing the source code of different SUTs manually, we found that the value range of some variables is hard to count. For example, some state variables are used as flags to represent n different kinds of information by setting each of the n bits to 0 or 1. Theoretically, the total number of possible values of such variables is \(2^n\) . However, the actual value that can be reached during execution may be far less than the theoretical, which is determined by complex semantic information and difficult to estimate. Hence, we build an approximate state space using the union set of the state variable values that all fuzzers have triggered in all runs during the 24 hours of fuzzing, and the state space coverage refers to the ratio of the number of state variable values that have been triggered during the experiment to the total number of state variable values. For services that contain more than one state variable, we add up the number of state variables.
Figure
6 shows the average state space coverage of different fuzzers on different target services in the 24 hours of fuzzing among 4 runs. The average state space coverage of each fuzzer on different targets is shown in the label of the x-axis. As we can see from the result, NSFuzz could explore more state values than other fuzzers in all cases. NSFuzz-V has a slightly lower state space coverage than NSFuzz on average, but the performance is still better than the other three fuzzers. AFL
nwe has the lowest average state space coverage, which is in line with the intuition since AFL
nwe is the only non-stateful fuzzer. For more stateful targets, including Forked-daapd, TinyDTLS, ProFTPD, and OpenSSL, the improvement of NSFuzz and NSFuzz-V on state space coverage is more significant. However, for other targets, such as Dnsmasq, Bftpd, and so on, all fuzzers achieve a 100% coverage; thus, the improvement is not obvious. One reason is that the theoretical maximal number of different values of some state variables with enum type is easy to reach so that all fuzzers could discover the whole space easily.
5.6 Real-World Bugs Finding Evaluation (RQ5)
To evaluate NSFuzz’s ability to discover vulnerabilities in real-world network protocol services, we deploy a long-term fuzzing campaign to find bugs in the latest version of the 13 protocols. We ran the fuzzing campaign for about 2 weeks and found several crashes in 2 of the target services. Details are provided in Table
9.
As shown in Table
9, NSFuzz found 8 vulnerabilities in total, 5 of which are found in 3 different versions of TinyDTLS and 3 are found in Dcmtk.
start here We reported all 8 vulnerabilities to the developers. The 2 crashes of TinyDTLS (commit fce3372) have been confirmed and fixed. One other crash of TinyDTLS (commit 7068882) has also been confirmed but will not be patched directly since the developers refactored this part of the code. The other 5 crashes were reported shortly before the submission of this article and are still waiting for confirmation. The results show that NSFuzz has the ability to continuously discover zero-day vulnerabilities of real-world network services.