Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

G M - R L N T A: Eneric Ulti Modal Epresentation Earning For Etwork Raffic Nalysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

G ENERIC M ULTI - MODAL R EPRESENTATION L EARNING FOR

N ETWORK T RAFFIC A NALYSIS

Luca Gioacchini, Marco Mellia Idilio Drago Zied Ben Houidi, Dario Rossi
Politecnico di Torino Università di Torino DataCom Lab, Huawei Technologies Co. Ltd
Italy Italy France
arXiv:2405.02649v1 [cs.LG] 4 May 2024

first.last@polito.it idilio.drago@unito.it first.mid.last@huawei.com

A BSTRACT
Network traffic analysis is fundamental for network management, troubleshooting, and security. Tasks
such as traffic classification, anomaly detection, and novelty discovery are fundamental for extracting
operational information from network data and measurements. We witness the shift from deep packet
inspection and basic machine learning to Deep Learning (DL) approaches where researchers define
and test a custom DL architecture designed for each specific problem. We here advocate the need for
a general DL architecture flexible enough to solve different traffic analysis tasks. We test this idea by
proposing a DL architecture based on generic data adaptation modules, followed by an integration
module that summarises the extracted information into a compact and rich intermediate representation
(i.e., embeddings). The result is a flexible Multi-modal Autoencoder (MAE) pipeline that can solve
different use cases. We demonstrate the architecture with traffic classification (TC) tasks since they
allow us to quantitatively compare results with state-of-the-art solutions. However, we argue that the
MAE architecture is generic and can be used to learn representations useful in multiple scenarios.
On TC, the MAE performs on par or better than alternatives while avoiding cumbersome feature
engineering, thus streamlining the adoption of DL solutions for traffic analysis.

Keywords Network measurements · Representation learning · Embeddings · Multi-modality

1 Introduction

The abundance of methods to collect data and the availability of simple frameworks sparked the adoption of Machine
Learning (ML) and Deep Learning (DL) algorithms to solve network traffic analysis problems. Examples are seen in
traffic classification [36, 34, 18, 4], in network security problems [39, 10, 30], and in the unsupervised exploration of
traffic [38, 12, 22], to cite a few.
In this context, researchers and practitioners have started applying solutions developed by the AI community to solve
various tasks [36, 10], sometimes borrowing Computer Vision (CV) methods [18, 45], sometimes Natural Language
Processing (NLP) methods [38, 14, 8].
However, the application of such algorithms in traffic analysis problems is often done by artificially forcing the input
data and measurements to fit the selected method. Examples include creating an “artificial image” out of univariate time
series to apply a 2D-Convolutional Neural Network (CNN) [18]; or mixing unrelated features in a single image-like
matrix [45]. In addition, the DL models are often complex and need to be tuned (in an end-to-end fashion) for each
input data type and target problem, to learn salient features and solve the specific problem at hand.
This raises the question of whether we can find a simple yet efficient Deep Neural Network (DNN) that is generic
enough to seamlessly encode different traffic measurements and solve multiple traffic analysis problems. The key
idea is to let the general DL architecture produce a compact representation (or embeddings) of the often diverse and
humongous input data. These embeddings could then be employed to solve other specific final problems (or tasks)
without the burdens of building models from scratch starting from the raw features and measurements.
Stage1: Stage2: Stage3:
Pre-processing Multi-modal Autoencoder Downstream Tasks

Quantities Integration Module

Quantities Quantities
Adaptation Adaptation Deep
Encoder Decoder
Entities Learners
Raw Data Adaptation
Entities Shallow
Learners
Encoded Features

Extraction Offline Pre-training Self-supervised Training Supervised Training

Figure 1: Overview of the modular Machine Learning pipeline relying on MAE.

(a) Quantities (b) Entities (c) RawCat (d) MAE


5-NN pC = 0.32 5-NN pC = 0.82 5-NN pC = 0.32 5-NN pC = 0.84
Figure 2: t-SNE projection of different representations of the original data. Each dot is a flow. Colours represent
the flow class. (a) Using only quantities results in mixing classes in the two large central groups; (b) entities offer
better separation of classes, but still mix part of them (surrounding border); (c) a simple concatenation of the two
representations introduces noise similar to quantity-only case; (d) MAE improves the representation, offering the best
separation among classes.

We here propose the adoption of a generic DL architecture to seamlessly create intermediate embeddings that succinctly
capture the salient features of raw traffic data. A key question is how to properly learn such embeddings from the
different types of input data typically found in traffic measurements. Network data consists of two modalities (see our
preliminary work at [19]): (i) entities (or categorical features) like IP addresses and port numbers; and (ii) quantities,
or various real-valued measurements like packet and flow size, time, etc. Here we extend this vision. First, we adopt
state-of-the-art approaches [14] to learn representations from sequences of entities, while we let general DL architectures
handle packet payload and generic traffic statistics. We then use a Multi-modal Autoencoder (MAE) to integrate and
summarise each representation into a common and compact embedding space. Notice that the MAE is trained using a
self-supervised task and thus enables the generation of the embeddings independently of the final downstream task.
In Figure 1 we provide the overview of the full pipeline we adopt. From the left, the monitoring platforms offer raw
data composed of quantities and entities. We first pre-process the entities mapping them in a latent space to extract
meaningful features in a self-supervised way. We feed the resulting entity embeddings and the original quantities to a
multi-modal autoencoder to reduce the dimensionality of the data and the redundant information the different inputs
carry. The latter includes quantities adaptation modules projecting the raw features to a shared latent space together
with the entity embeddings. We train the MAE in a self-supervised way to produce a compact representation, i.e. the
multi-modal embeddings. This intermediate representation is then used to solve the final downstream tasks. Note that,
in contrast with the common end-to-end training used in most state-of-the-art models, we train the models in each stage
of our pipeline independently in a modular way.
In a nutshell, we implement the vision we introduced in [19]: we move forward toward the adoption of a common
DL architecture that is useful for traffic analysis. To demonstrate the feasibility of this approach, we focus on Traffic
Classification (TC) uses cases that we select to objectively compare our solution against state-of-the-art alternatives.
Results show that our MAE architecture performs better or on par with the specialised models.

2
To give an intuition of how different data representations may be instrumental for, e.g. a classic traffic classification
task (as used in our validation), Figure 2 compares the t-SNE [43] projections using different representations of the
original data. Each dot in the figure is a flow, and each colour is a class (details in Section 4). Figure 2a and Figure 2b
show the representation when considering only either quantities or entities separately. For this use case, entities offer
better separation of classes than quantities. Figure 2c shows what happens when we use a simple concatenation of the
two features as typically done in the literature. Confusion appears and classes get mixed again, as one could expect
given the high dimensionality of the concatenated data. At last, Figure 2d shows the benefits brought by the MAE. In a
nutshell, the resulting intermediate representation leverages the information of both quantities and entities, producing a
compact representation that already enables a good separation of classes even in such a simplistic t-SNE projection. We
will show later that this representation also paves the road for excellent results in specific downstream tasks with other
ML algorithms.
Summarising our contributions:
1. We present a simple and streamlined DL architecture that one can use to solve generic traffic analysis-related
tasks, avoiding the need to design custom and specific DL architectures for each problem.
2. This architecture supports both entities and quantities and automatically projects the information they carry
into a compact embedding space.
3. We test our approach in three traffic classification use cases to illustrate the goodness of the embeddings,
showing that they carry information that successfully allows one to solve supervised traffic classification tasks.
Finally, we offer to the community all our MAE models, together with all our source code to let other researchers
experiment with different tasks and use cases.1
After presenting related work in Section 2, Section 3 describes the generic building blocks we propose to derive
embeddings of typical entities found in traffic traces, and the architecture of the Multi-modal Autoencoder to merge
various input data into intermediate embeddings. After describing the use cases and datasets used for our validation
experiments in Section 4, we validate the proposal with downstream traffic classification tasks in Section 5 and highlight
the benefit of the multi-modality compared to individual measurements in Section 6. We then discuss the quality of
the multi-modal embeddings in Section 7 and illustrate their applicability in other scenarios (e.g. shallow learners)
in Section 8. Section 9 lists the limitations of the approach and some open questions not addressed in this work. In
Section 10 we draw our final considerations.

2 Related Work
Recent research work focuses on automating traffic analysis through DL models, often borrowing techniques from
diverse fields like NLP and CV. While not comprehensive, we report in Table 1 the most notable papers of the last
seven years. We focus on top networking conferences and journals, listing only the most relevant papers dealing with
supervised traffic classification.
We observe many works that propose custom and specific DL architectures for host and traffic classification, i.e.,
labelling traffic either at the host level (e.g. IP address classification) or flow level (e.g. application identification). They
then apply the classifiers on network intrusion detection, OS identification, etc. – see [36, 10] which present generic
surveys.
We split the normally used input features into quantities and entities. As emerges from Table 1, most of the prior work
relies only on quantities, ranging from statistics like the flow size, the first k packets sizes, packet inter-arrival times,
and generic sequences of such quantities [4, 27, 21, 37, 3].
The architectures used to manage quantities confirm the increasing trend of adopting DL techniques, often adapting
techniques first introduced in research fields different from networking. To extract temporal and spatial correlations
from statistics of packets streams (e.g. minimum, average, maximum of a set of observations), many works adopt
Convolutional Neural Networks (CNN) [18, 45, 27, 36] in a CV style. These works select a CNN architecture as an
end-to-end classifier.
For application payload (L7), Recurrent Neural Networks (RNN), like Gated Recurrent Unit Networks (GRU) are
nowadays preferred. These architectures are successfully used in NLP for extracting character embeddings [32] from
the words they appear in. The intuition is that the sequence of bytes in packet payload matters, and RNN/GRU can
leverage such information. Similarly, to manage sequences of measurements (e.g. sequence of the length of first k
packets in a flow), Convolutional Neural Networks (and 1D-CNN in particular) are the typical choices.
1
https://github.com/SmartData-Polito/multimodal-ae-for-networking.

3
Table 1: Related studies on DL techniques for automatic TC.
Downstream
Quantities Entities
Classification
Year Stats. L7 Payload Sequences IP Addr. Ports Hosts Traffic
Vu et al. [44] 2017 FE+GAN – FE+GAN FE+GAN FE+GAN ✓
Wang et al. [45] 2017 CNN CNN CNN 1HE 1HE ✓
Lotfollahi et al. [27] 2017 CNN/SAE CNN/SAE CNN/SAE – – ✓
Hochst et al. [21] 2017 AE AE – – – ✓
Gonzalez et al. [15] 2017 – – Net2Vec Net2Vec – ✓
Ring et al. [38] 2017 Word2Vec – – Word2Vec Word2Vec
Wang et al. [46] 2018 CNN – CNN+LSTM – – ✓
Rezaei et al. [36] 2018 CNN CNN CNN – – ✓
Aceto et al. [3] 2019 – GRU CNN – – ✓
Aceto et al. [4] 2019 – GRU CNN – – ✓
Cohen et al. [8] 2020 – – – Word2Vec Word2Vec
Holland et al. [17] 2021 1HE – – 1HE 1HE ✓
Shahraki et al. [40] 2021 CNN – – – 1HE+CNN ✓
Aceto et al. [4] 2021 – GRU CNN – – ✓
Gioacchini et al. [12] 2021 – – – Word2Vec – ✓
Horowicz et al. [18] 2022 CNN – – – – ✓
Kallitsis et al. [23] 2022 AE – – – 1HE+AE ✓
Lin et al. [26] 2022 FE – BERT – – ✓
Gioacchini et al. [14] 2023 – – – Word2Vec – ✓
Gioacchini et al. [13] 2023 TGNN – TGNN TGNN TGNN ✓
Li et al. [25] 2023 FE – FE – – ✓
Yang et al. [48] 2023 AE – GRU – – ✓
Zhao et al. [49] 2023 – BERT BERT – – ✓
Huoh et al. [20] 2023 GNN/AE GNN/AE GNN/AE – – ✓
Pang et al. [35] 2023 GNN BERT/GNN – – – ✓
This work FC GRU CNN Word2Vec Word2Vec ✓ ✓

† FC: Fully Connected; GRU: Gated Recurrent Unit; AE: Autoencoder; 1HE: One Hot Encoded;
GAN: Generative Adversarial Network; SAE: Sparse Autoencoder; FE: Feature Engineering;
TGNN: Temporal Graph Neural Network

To learn better representations, some recent works [23, 27, 21, 20] rely on Autoencoders, which allow projecting
input features to a more compact latent space discarding uninformative features, or on transformers [26, 49, 35], like
BERT [9]. Since we aim at managing generic input statistics, we will use a simple Fully Connected (FC) layer for the
embedding creation, showing it suffices to compress the statistics and remove redundancy in input.
Few researchers started using entities in the classification pipeline (e.g. source IP address, or targeted ports). However,
they represent such categorical features through One-Hot Encoding (1HE) [23, 45, 44, 17, 40] where each distinct IP
address or port is treated as a binary category. Such an encoding technique is affected by scalability problems. Indeed,
in the case of IP addresses, the size of the representation might increase up to 232 (2128 ) when encoding the whole
IPv4 (IPv6) address space. This leads to (i) computational and memory issues and (ii) the curse of dimensionality
problem [5].
Another emerging trend is the adoption of Word2Vec [28, 29] or Graph Neural Networks (GNN) [35, 47] to produce
embeddings for network entities [38, 12, 8, 14, 13]. They provide rich representations of entities by solving a self-
supervised task (i.e. requiring no labels). We here adopt Word2Vec as a solution to learn embeddings from entities.
Our MAE architecture is however generic and can encompass other representations too (see our preliminary work on
GNN embeddings [13]). Indeed, GNNs and others have the potential to create novel features that our MAE architecture
would seamlessly integrate into the learning pipeline.
Finally, toward the direction of multi-modality, in which heterogeneous features not originally comparable are merged
in a common latent space, only a few works [25, 48] started exploring data fusion approaches (see also [11] for a
generic survey). A common drawback of previous efforts is that they propose ad-hoc, end-to-end DL architectures.
During the training phase, the classifier learns its internal representation which results in the best separation of the
classes in the original feature space. Even though such representations provide often excellent results for the specific
task, they are tied to the considered scenarios and cannot be applied to other cases.
The main difference in our work from the previous DL proposals is that we design a generic DL architecture to generate
intermediate embeddings from the original multi-modal data. We build on our previous work [19] but extend and
generalise the approach by proposing a multi-modal autoencoder as architecture to derive generic embeddings from
raw features. Such embeddings can then be used to solve the downstream task not only with DL approaches but also
with simpler shallow learning models. They pave the road for a single generic architecture that could be used for
heterogeneous tasks.

4
Multi-modal Autoencoder

Quantities adaptation Merge Integration module Split Quantities adaptation

Payload GRU GRU Payload

Statistics FC Coded FC Statistics

Entities adaptation Sequences 1D-CNN 1D-CNN Sequences

...

...

...

...

...

...

...
IPs Word2Vec Embeddings l4 Embeddings
l3 l3

Ports Word2Vec Embeddings l2 l2 Embeddings


l1 l1

Raw measurements (Individually or concatenated) Encoded features

Figure 3: Multi-modal autoencoder architecture.

3 Representation Learning for Traffic Analysis


The prevalent approach in DL-based solutions for network traffic analysis involves crafting ad-hoc architectures
customised for specific tasks. However, these tailored solutions, due to the diverse range of tasks involved, frequently
necessitate extracting specific features through a time-consuming feature engineering process.
In contrast, we advocate the usage of representation learning. We propose a generic and flexible architecture to produce
an intermediate compact representation of network data by merging the informative content of heterogeneous features
(i.e. entities and quantities). This intermediate representation serves the downstream task, reducing the effort required
in the feature engineering stage and paving the road for unexplored solutions involving both quantities and entities.
The choice of Autoencoders as the key model for the proposed architecture is straightforward due to their self-supervised
learning abilities and capacity for dimensionality reduction. They efficiently compress input features, retaining crucial
information and minimising the training cost required by more complex architectures, such as transformers.
Our representation learning architecture is logically composed of two main blocks as sketched in Figure 3. First, we
consider adaptation modules whose goal is to extract useful representations that are tailored to each input type, where
a specific DL architecture deals with each specific traffic data. Each module will produce as output a representation
of the original data in a space of l1 dimensions. The second step consists of an integration module that merges and
compresses the information captured by each adaptation module into a compact embedding space.
The overall architecture results in a Multi-modal Autoencoder, which is self-trained to minimise the reconstruction
error between the input measurement samples (on the left) and the reconstructed output samples (on the right). Once
trained, we discard the decoder and use the encoder to compute the embeddings, mapping each input sample into the
intermediate representation space in Rl4 (coloured nodes in Figure 3). We call this representation E(s).

3.1 Raw measurements

We assume a set of monitors i ∈ I gathers raw measurements M i (s) concerning samples s ∈ S. Samples s are
the objects of the final downstream task. Samples can be TCP/UDP flows, hosts (identified by their IP address),
domain names, etc. The i-th module can be a DPI monitor that extracts per packet information (e.g. packet length,
source/destination IP addresses, protocol fields, etc.), a flow monitor, such as Netflow that aggregates statistics about
the flow properties (e.g. flow duration, flow size, client/server IP addresses, mean packet size, etc.), or a firewall
that observes the sequences of packets reaching the internal networks, e.g. sequences of ports contacted by a client,
sequences of IP addresses reaching a web server, etc.
In general, the i-th measurement module will expose a set of measurements for each sample s:
M i (s) = {mi1 (s), mi2 (s), . . . , mini (s)} ∈ Rni ,
where ni is the number of measurements such module extracts.
Table 2 exemplifies some measurements that our architecture supports. As previously advocated, network data entails
two distinct modalities, i.e. entities and quantities. Quantities are real-valued numbers that convey some salient
characteristics. Quantities are already in numerical form and suitable for ingestion into ML models. Some classic
quantities include statistics at the flow level like minimum, maximum, and average packet size, or sequences of
numerical values like the length of the first k packets in a flow.

5
Table 2: Examples of typical features exposed by network monitors that we can consider to create an embedding of
samples.
Measurement Description
Payload Raw values of first n bytes of the L7 payload
Min, max, avg, std of packet size per flow
Min, max, avg, std of packet inter-arrival time per flow
Statistics
Min, max, avg, std of ports contacted by clients
Quantities Min, max, avg, std, ... of other metrics
Length of the first k packets per flow
Inter-arrival time of the first k packets per flow
Sequences
TCP window size of the first k packets per flow
Other quantities over time
IP address Sequence of client (server) IP addresses observed over time
Ports Sequence of client (server) ports observed over time
Entities
AS Sequence of DNS names resolved over time
Other Sequence of other entities over time

Conversely, entities represent categorical features that are not in an ordered relationship. They are traditionally
represented in ML tasks using simple One-Hot Encoding. Such a representation fails, however, to capture important
properties of the entities. For example, IP addresses or TCP/UDP ports are typical entities used for several traffic
classification problems. However, One-Hot Encoding does not scale. Noticing the similarity between entities and words
in natural language, we promote the use of NLP model pre-training to learn useful representations from sequences of
entities, hence the need for some pre-adaptation layer (see below).
For entities, we consider sequences of observations of samples, let them be IP addresses, ports, DNS names, or others.
The order and context in which such entities co-occur carry valuable information worth feeding as input to the ML
model to increase its chances of succeeding in the task. For instance, a sequence of server names contacted over time by
a client may reveal the application such client is running. Similarly, the sequence of ports a sender checks on a server
may reveal which type of scan it performs.

3.2 Adaptation modules

We propose to use specifically designed DL adaptation modules to project each feature into a common space of size l1
before integrating them using the MAE. In the following, we describe this specific step for each input type. For a more
detailed background and formalisation of the architecture used in the adaptation modules refer to A.

3.2.1 Sequences of Entities adaptation


We leverage the fact that entities are typically observed as a sequence over time, e.g. the sequence of server IP addresses
contacted by a client or the sequence of ports scanned during a port-scan attack. We here advocate the use of language
model pre-training to learn representations for each entity (e.g. the IP address or the port) by exploiting their co-
occurrences over time as observed in historical data. In particular, we leverage Word2Vec [28, 29], an NLP technique
relying on artificial neural networks to map words occurring in a sentence into a context-aware latent space (i.e. words
belonging to similar contexts appear close in the latent space) in a self-supervised manner.2 By defining the context of a
target word as the set of the n words surrounding it in a sentence, we adopt the skip-gram [28, 29] technique, which
consists of predicting the probability that a word belongs to the context of the target word. n is a user-defined parameter
impacting the granularity of the detected context.
Leveraging Word2Vec, given a sequence {s1 , s2 , s3 , . . .} of entities observations, we map each entity sj → uj ∈ RV
where uj is the embedding of sj in the V dimension space. As said, prior works have shown that Word2Vec can extract
a valuable representation from sequences of IP addresses [38, 12, 14, 23, 8]. Here, to represent sender IP addresses, we
consider their co-occurrence when targeting the same port or host in a given time window.
Generalising the approach to all entities, we propose to use it systematically to assign a compact yet rich feature vector
to each entity. For instance, we generate port embeddings feeding Word2Vec with the sequence of server ports targeted
by each sender as proposed by authors of [8].
2
Note that other representations, e.g., obtained with GNNs [13], could be used in this step. Including new adaptation modules in
our architecture is trivial and we leave the benchmark of their impact for future work.

6
The dimension l1 of the space is a parameter that impacts the embedding quality. Any choice of l1 ∈ [50 − 200] results
sufficient to produce valuable embeddings. We investigate its impact in Figure 10.
Given that Word2Vec is a self-trained architecture, we train the embeddings separately from the MAE and give the
results as input directly to the integration module.

3.2.2 Quantities adaptation


Quantities adaptation modules are neural networks tailored to each quantity type and trained with self-supervision to
extract salient features from quantities.

Application payload adaptation Even with the increasing adoption of encrypted protocols (such as QUIC, H3, TLS
1.3, DoH, etc.), network traffic still includes portions that are transported in plain text, such as specific protocol headers.
Indeed, previous works [4, 31] have shown that training a NN directly with raw payload results in models able to
perform e.g. flow classification with high accuracy. Our architecture aims to streamline the traffic analysis problem by
combining multiple types of features. We thus introduce an adaptation module that handles payload directly, leaving to
the Integration Module (described next) the task of learning the salient information from such features (if any).
Given a flow and the first n bytes of payload, we propose adapting it using an embedding layer, which allows mapping
the categorical values taken by each byte into a compact space, followed by an RNN, which allows leveraging of
information carried by consecutive portions of the payload, as proposed in [4, 31]. Here, we use RNNs as they are
widely used for learning sequential data prediction problems, but any sequence-friendly neural network architecture
could fit the purpose.
In our experiments, we use a GRU which is like a Long Short-Term Memory (LSTM) with a forget gate that allows the
network to forget very old correlations. It has fewer parameters than an LSTM and often converges faster. The essential
operation of LSTM or GRU networks can be considered to hold the required information and discard the information
that is not helpful for the prediction. The intuition here is to let the GRU exploit the correlation found in nearby bytes in
the payload, e.g. server names found in clear in the Server Name Indicator (SNI) in TLS, or specific values in TCP
Options or IP headers.

Traffic statistics adaptation Consider statistics M i (s) = {mij (s)} extracted by the monitor i for a sample s.
Examples of mij (s) include the minimum, maximum, average, standard deviation, histograms, etc. For instance, if s is
a TCP or UDP flow, the monitor computes such statistics on all flow packets. Instead, if s is a host, statistics derive
from the set of contacted ports, the set of contacted server IP addresses, etc.
Given a set of monitors I, we expect such statistics to be correlated among them (e.g. minimum, mean, median, etc.),
and among different measurements (e.g. statistics on the number of packets and the number of bytes, on sent and
received traffic, etc.). As such, we encode such statistics using a Fully Connected (FC) layer to reduce such correlation
while compressing and adapting the input data to the integration module. For this, as input, we feed the FC with the
concatenation of all statistics for a given sample s, i.e. M I = {M 1 (s), M 2 (s), . . .}. As output, we obtain a projection
in a reduced space of size l1 .

Sequence of quantities adaptation When the order of measurements is important, it is common to use sequences of
observations of a sample s as input features. For instance, the sequence of the first k packet lengths and inter-arrival
times of a given flow is often used for the flow classification [4, 46, 37].
As commonly done in past works (see Table 1), we encode such sequences using a CNN. We use a one-dimensional
CNN (1D-CNN) that lets us obtain compact information from such correlated measurements and reduce the overall
complexity of the model in terms of the number of parameters. As before, the input of the CNN will be the sequences
of all measurements. The output will be a l1 projection that will be given as one of the inputs to the integration module.

3.3 Integration module

The goal of the integration module is to fuse different measurements that derive from different monitoring modules and
their corresponding adaptation layer. This is particularly important when the number of input measurements is large,
i.e. when commonly done when using DL approaches that use all possible data the various monitors produce without
performing a prior feature selection or engineering.
To this goal, we propose to use a symmetric autoencoder architecture made of five layers, each of l2 , l3 , l4 , l3 , l2 neurons
with l2 > l3 > l4 . As input, it takes the output of each adaptation module. The autoencoder then compresses this
information so that the inner layer of l4 neurons produces a compact representation of the input measurements E(s). To

7
Table 3: Downstream tasks used in our validation and summary of the datasets.

Task Dataset Sample Label # Samples # Classes Balance


T1 MIRAGE [2] Flows Mobile apps 44k 16 0.94
T2 DARKNET [12] IP addr. Coord. groups 14k 13 + 1 0.4
T3 ISCX [1] Flows Traffic type 609 5 0.82

train such a network, the architecture is symmetric so that, starting from the encoded features, a symmetric structure
rebuilds the input structure. The training of the autoencoder is self-supervised and does not require any labels.
The overall system results in a multi-modal autoencoder that includes the quantities adaptation modules as well. Except
for the entities adaptation modules which are pre-trained separately, the training of the MAE includes both the quantities
adaptation modules and the integration module. The overall system is trained to minimise the reconstruction error from
the original measurements to the reconstructed outputs.
Since the MAE is an Autoencoder, whose goal is reconstructing the input data at the output layer, we essentially solve a
regression task. Hence, we use a linear activation function in all the decoding output layers (i.e. quantities adaptation
outputs and entity embeddings) to achieve an unbounded range of output values favouring the input reconstruction.
Additionally, we use the Mean Squared Error (MSE) as the loss function of each adaptation module. To balance the
contribution of each measurement (quantities and embedded entities), we weigh the losses proportionally to the number
of features (embedding size) obtaining the final weighted MSE for the whole MAE.
Notice that each adaptation module is designed specifically for the raw measurement to adapt. For a detailed description
of the adaptation modules, refer to B. In the following, we compare the performance of using different feature sets
when facing three downstream traffic classification tasks.

4 Validation tasks and datasets


We here focus our evaluation on supervised traffic classification tasks. These tasks allow us to compare the performance
of our architecture against state-of-the-art solutions objectively. Our architecture is however generic and can be
employed in other downstream tasks too.
To create a solid baseline, we select a set of tasks from previous articles that proposed end-to-end classifiers and
contribute open datasets to the community. We select three traffic classification tasks requiring different sets of features,
whose datasets are well-documented.3 The presence of labels (i.e. a ground truth) allows us to objectively evaluate each
approach, assessing and comparing the quality of the representations used as input.
Table 3 summarises the downstream tasks and datasets we use. It shows the task number, dataset name and reference,
the target samples to be classified, and the type of labels, followed by some statistics about the dataset size, the number
of classes, and the balancing coefficient.4
We select cases with different numbers of classes, class balances, and dataset sizes. Next, we briefly describe the tasks
and provide more details about the datasets.

4.1 T1 : Identification of mobile apps traffic

The identification of the applications generating traffic is one of the classic examples of TC. This problem has been
approached with multiple formulations, but in general, the goal is to identify which application generated particular
traffic flows.
We use as our first task the problem and dataset presented in [2]. The authors focus on classifying TCP and UDP/QUIC
flows as generated by one among multiple mobile apps. They collected two years of traffic from volunteers who
installed a custom monitoring app, from 2017 to 2019, releasing the MIRAGE-2019 dataset (hereafter called MIRAGE).
Each flow in the dataset is labelled with the name of the mobile app that generated it or as background traffic (filtered
out in a post-processing step). There are 16 different applications, from social network apps to e-commerce and news
apps.
3
Ideally, the power of alternative representations could be better compared using a single dataset in multiple downstream tasks.
However, the networking community still misses established and generic datasets and benchmarks.
4
The coefficient is defined as the Shannon entropy (with respect to the classes of the problem) divided by the logarithm of the
number of classes. The coefficient is closer to 1 when classes are balanced.

8
The dataset comprises 44k flows (our sample set S), with well-balanced classes (coefficient equal to 0.94). Each flow is
characterised by the sequence of packet inter-arrival times and length, the sequence of the TCP receiver window values,
the application payload of up to 32 bytes, the server IP address, and its /24 subnet.

4.2 T2 : Classification of darknet traffic

Our second task is the classification of darknet traffic [12, 22, 23]. Darknets are networks used to monitor unsolicited
traffic, such as Internet scans and attacks. They are composed of IP addresses announced in routing protocols, but
without hosting services. The identification of interesting events in darknet traffic is hard, as hundreds of thousands of
attackers and scanners continuously reach the darknet IP range. However, previous work has shown that IP addresses
belonging to a limited number of botnets and scanners dominate this traffic.
The task is to classify the sender IP addresses (our sample set S) as belonging to one of the multiple classes of known
scanners and attackers. For this problem, authors have leveraged different measurements, including raw per-sender
traffic statistics and features obtained using embedding techniques such as Word2Vec [12] or feature engineering and
autoencoders [22, 23].
We here rely on the public dataset [12], which we call DARKNET. There are 13 categories of known senders which
include several benign security services, research scanners, and known botnets. The dataset also includes one extra
class with all unknown IP addresses.
In total, DARKNET collects one month of data reaching one /24 darknet. By considering all senders that sent at least five
packets in such period, we have 14k sender IP addresses, among which there are 5k addresses from known classes;
the rest belong to the unknown class. In contrast to the other tasks, this dataset shows strong class unbalance, with a
coefficient equal to 0.4.
As measurements, M i (s), here we consider monitors that expose the sequence of sender IP addresses as they appear in
the trace, statistics of the TCP/UDP ports they target, statistics of packet length each sender sent, and the /24 subnet the
sender belongs to.

4.3 T3 : Classification of traffic flows

Our last task deals with classifying traffic flows (our sample set S), i.e. video streaming, voice, browsing, etc. This task
is also a classic in TC and is particularly hard because most services rely on encrypted protocols (e.g. TLS). Multiple
authors have proposed methodologies, usually based on supervised classifiers that take packet-level measurements as
input.
We here rely on the ISCXVPN2016 [1] (ISCX for brevity) dataset introduced in [18, 1]. We consider only the non-
tunnelled traffic in the dataset, which includes five traffic categories: VOIP, streaming, chat, mail, and file transfer. The
dataset is very limited, with 609 flows that are well-balanced among the five classes (coefficient 0.82). Such a small
dataset poses a challenge for DL training, and we thus use it to check how the MAE behaves with scarce data.
Here, each TCP or UDP flow is labelled with a category. Each flow s is characterised by several measurements M i (s):
the server IP address and its /24 subnet, the first 32 bytes of application payload, statistics collected from TCP payload,
and the sequence of the first k = 32 receiver window size, packet inter-arrival times and length.

5 Assessing the MAE Embeddings

5.1 Experimental Setup

Using the three supervised downstream tasks, we evaluate the results obtained with the MAE embeddings and compare
them against traditional approaches. In this section, we provide a high-level summary of the results using our approach
and state-of-the-art alternatives. In the next section, we present an ablation study to assess the impact of the different
components of our method.
Given a set of input measurements M (s), we train a classification model f : f (s) → C(s), being C(s) the class of
s. For each task, we present two experiments: (i) We naively achieve a multi-modal representation of the different
inputs by simply concatenating the embedded entities together with remaining raw data to obtain a single input
vector M (s) = concat(M 1 (s), M 2 (s), . . .). Then, we train a model f : f (M (s)) → C(s); (ii) we consider the
representations E(s) obtained with our MAE. Here, we train a model f : f (E(s)) → C(s).

9
Table 4: Macro and average F1-Scores across tasks. Reference is the macro average F1-Score reported in the original
work. For T3 , this is missing, since we use only a fraction of the original dataset. The best results are in bold.
Macro F1-Score Weighted F1-Score Trainables
Reference Concat. MAE Concat. MAE Concat. MAE
T1 0.88 0.89 0.87 0.90 0.88 29 × 104 17 × 104
T2 0.96 0.98 0.98 0.99 1.00 33 × 104 17 × 104
T3 – 0.66 0.68 0.74 0.75 28 × 104 17 × 104

For each M (s), we run 5-fold cross-validation experiments reporting average statistics. When validating models for T2 ,
we ignore the unknown class as done by the original authors of the datasets. For T1 and T3 , we use stratified sampling
to guarantee that flows belonging to the same application session appear either in the training or in testing sets.
To achieve an architecture that avoids excessive computational times and memory requirements, we fix the sizes of
the hidden layers of the integration module l2 and l3 to 512 and 256 neurons, respectively. These numbers have been
defined after a coarse grid search discussed in C.
After training the entities adaptation modules and the MAE, we extract the resulting embeddings of a sample E(s) and
we use it as input to the DL classifier f . We consider the same DL architecture for f in all the experiments. Specifically,
we use a fully connected feed-forward network with two hidden layers (of 512 and 256 neurons, respectively), and
an output layer with several neurons equal to the number of task classes. To prevent overfitting we use a dropout
of 30% for each layer. The activation function of the hidden layers is a standard ReLU, whereas the output layer is
activated through the Softmax activation function. We use class balancing during training to weight the loss function
and minimise the categorical cross-entropy using an ordinary Adam optimiser [24].

5.2 Classification performance

In Table 4 we report the macro and weighted average F1-Scores5 resulting from the classification tasks, along with the
number of trainable parameters of the classifiers as a proxy of the computational cost of classifier training.

T1 – MIRAGE dataset Focus on the MIRAGE dataset (T1 ). The best results are obtained when using concatenation in a
single input vector, reaching 0.89 macro average F1-Score. Yet, the classification using the MAE is very similar to the
raw concatenation case, losing 0.02% of the macro average F1-Score. Intuitively, the compression performed by the
MAE reduces the redundancy of information carried by correlated features with a minimal impact on the final results.
The slightly better results of the raw concatenation come at the expense of downstream model complexity. Indeed, the
concatenated input layer has about 300 neurons resulting in 29 × 104 trainable parameters, while the MAE with just
l4 = 64 input neurons halves the number of trainable parameters down to 17 × 104 achieving comparable classification
performance.
These results are also in line with the original paper, where authors report 88% macro average F1-Scores [3] using
an end-to-end DL pipeline that relies on the payload and the sequences of packet measurements as input. The MAE
features achieve performance in the same range with a more compressed representation.

T2 – DARKNET dataset Move now to the DARKNET dataset (T2 ). Firstly, both the naive multi-modal solution and the
proposed MAE result in a slight improvement compared to the baseline.
The concatenation of all raw measurements here performs slightly worse than using the MAE (weighted average F1-
Score) – specifically, one class gets much worse results. This happens because some measurements bring unimportant
or redundant information for this task, polluting results for the concatenation case. This is an expected result given
the importance, for ML practitioners, of removing noisy features that may degrade the performance of classifiers.
Notice that, along with the slightly improved performance, the MAE allows also a reduction of the number of trainable
parameters of > 50% – 17 × 104 compared to the 33 × 104 of the raw concatenation.

T3 – ISCX dataset Finally, check the results obtained with the ISCX dataset (T3 ). Recall that we have a limited dataset
of about 600 samples and thus we expect this case to be a challenging scenario for any machine learning algorithm, in
particular those based on DL models.
5
Macro average is the arithmetic mean of the individual class scores, while weighted average includes the individual class sample
sizes. The latter better represents results in practice. However, for highly unbalanced cases, it provides biased results for the minority
classes.

10
(a) T1 , MIRAGE dataset. (b) T2 , DARKNET dataset.

(c) T3 , ICSX dataset.


Figure 4: F1-Score distributions per class for each classification task and different input measurements

Even though the overall classification performance cannot be considered satisfactory (average F1-Score < 80%), we
can still confirm the considerations drawn from the previous tasks: The MAE embeddings allow the DL classifier to
achieve performance better than the one obtained with the crude concatenation of raw features reducing the number of
training parameters from 28 × 104 down to 17 × 104 .
In a nutshell, also in this case the, MAE embeddings capture key characteristics of the data, even in such a small sample,
reducing the trainable parameters by ≈ 50%.

Takeaway: The experiments show that for TC problems the same generic MAE architecture can be used to face very
different supervised tasks. On one hand, the adaptation modules allow considering as input heterogeneous measurements.
On the other hand, the integration module allows compressing the input data reducing downstream model trainable
parameters by ≈ 50%, without losing performance.

6 Ablation Study

We now delve into an ablation study. Provided that the MAE key advantage is its ability to combine multiple features,
coming from the multiple modules, we focus on assessing the impact of adding the integration module. We here
consider only the best parameters for the MAE architecture, which have been found via a grid search, summarized in
C.6
For each downstream task, we compare the classification performance of MAE embeddings against the performance
obtained using a single measurement. These experiments show in particular the impact of our multi-modal strategy on
performance. We consider separate classifiers f i trained using only individual measurements, i.e. given M i (s), we train
a model f i : f i (M i (s)) → C(s). We embed entities using the Word2Vec, and we feed other measurements directly as
input to each separate classifier.
We summarise the results with the three downstream tasks. Figure 4 shows a summary of the results using violin plots,
whereas Table 4 reports the details with macro average F1-Scores. Each violin in Figure 4 represents a distribution
of the F1-Scores for all classes in a given classification task. Empty violins represent the distributions for models f i

6
Another valid ablation study would be the evaluation of the importance of individual layers of a pre-trained MAE on results. We
leave this study for future work.

11
Table 5: Macro average F1-Scores across tasks when using independent features and multi-modal features. The best
results are in bold.
T1 T2 T3
IP address 0.82 0.96 0.47
Entities Ports – 0.91 –
Subnet 0.76 0.70 0.54
Payload 0.53 – 0.35
Quantities Statistics 0.78 0.98 0.65
Sequences 0.59 – 0.59
Multi-modal MAE 0.87 0.98 0.68

trained with separate sets of measurements. Green violins refer to the results of using the MAE as input to the DL
classifier.

MIRAGE dataset Start from the MIRAGE dataset (T1 ) in Figure 4a. When using a single set of raw measurements (top
white violins), the distribution of the F1-Score across classes is widely spread.
First, the classifier trained with entity (IP) embeddings alone, a feature not used in the original task [3], achieves better
performance than the ones trained with quantities, reaching 0.82 Macro average F1-Score. This further motivates our
call for the systematic inclusion of entity information in addition to classic quantities. For quantities, the model trained
with flow traffic statistics obtains satisfactory results (0.78 Macro average F1-Scores).
MAE leads to the best results of 0.87 macro average F1-Score compared to the independent measurements. This result
confirms that, for this use case, the multi-modal architecture with all adaptation modules produces a rich and compact
representation, and improves performance over a DL classifier trained on individual measurements.

DARKNET dataset Consider now the DARKNET (T2 ) dataset in Figure 4b and Table 4. Here, we see again that
classifiers trained with separate sets of measurements deliver performance that varies greatly according to the employed
measurement set. Only traffic statistics let the classifier reach solid results (macro average F1-Scores of 98%).
The MAE (with all its adaptation modules) shows an important advantage in this case: The methodology transparently
overcomes the presence of uninformative features in these input sets. We see in the green violin plot in Figure 4b that
the distribution of F1-Scores of a DL classifier with MAE encoded features is more compact, achieving excellent results.
More important than searching for performance improvements in this particular downstream task, we stress the fact
that the MAE embeddings are obtained with a generic feature extraction approach, thus streamlining the procedure for
different problems.

ISCX dataset Finally, focus on ISCX dataset (T3 ) results. We see in Figure 4c that the various distributions of
F1-Scores are much wider than in previous tasks. Contrary to the previous tasks, IP address embeddings have worse
performance compared to quantity features (0.46 of macro F1-Score).
Despite the overall MAE classification performance that remains unsatisfactory (0.68 of F1-Score), also in this case
our proposed architecture can capture key characteristics of the data automatically. Indeed, the MAE preserves the
informative content of heterogeneous features, discarding useless information and bringing an improvement compared
to the individual measurements (from +3% compared to traffic statistics to +33% compared to L7 payload).

Takeaway: The obtained results underscore the advantage of MAE in preserving informative content across diverse
features, and in effectively discarding irrelevant or uninformative content. The combination of several features results in
better performance overall, even in the presence of uninformative features. This is possible thanks to the Integration
Module that compresses information discarding the eventual unnecessary data. The complete MAE thus leads to a more
streamlined and enhanced representation of network data across various tasks.

7 Embedding properties
The previous sections show that MAE embeddings are informative for traffic classification problems. These results
raise the question of whether they carry sufficient discriminative power to support other downstream tasks too. Here we
test whether the embeddings would support distance-based algorithms. Recall from Figure 2d that we are searching for

12
1.0
T1 , MAE
0.8 T1 , Concat

Macro avg. pC
T2 , MAE
0.6 T2 , Concat
T3 , MAE
0.4
T3 , Concat
0.2

0.0
1 5 10 15 20
K

Figure 5: Macro average K-NN class probability.

rich representations that enable several downstream tasks. To gauge the quality of the embeddings, we investigate the
local neighbourhood of samples of known classes, assessing whether their neighbourhood includes samples of the same
class.

7.1 Methodology

We study how samples with ground-truth labels are positioned in the latent space. We define a metric that quantifies the
pureness of the neighbouring samples as follows. Given a sample si of class ci and the set of its K-nearest-neighbours
NiK , we define the K-NN class probability for the sample si as:

pi,C (K) = P (C(sj ) = ci | sj ∈ NiK )


In words, the metric indicates how pure the neighbourhood of si is, with pi,C = 1 indicating that all K neighbours of
si are of the class ci .
Given K, we compute the macro average of the pC (K) by computing first the average among all samples of a given
class and then averaging over classes. We use the cosine distance to find the neighbours, a metric often used as a
distance when dealing with embeddings [6, 29].
We compare pC (K) considering (i) the space defined by the concatenation of all raw measurements, (ii) the MAE latent
space.

7.2 Neighbourhood pureness

Figure 5 shows the macro averages of pi,C when varying K for the MAE embedding (solid lines) and the concatenation
of raw measurements (dashed lines). Colours distinguish the three use cases.
Focusing on T1 , we see that the neighbourhood of samples is very polluted when using the brute concatenation of
raw measurements (red-dashed lines). Back to Figure 2c, the t-SNE shows some areas with clearly pure clusters but
also two large clouds of samples of quite mixed classes. There, we expect distance-based algorithms to likely group
elements of different classes in single clusters.
The MAE produces a latent space that completely reorganises the samples, in this case placing elements of the same
class nearby in space (see red-solid lines), although it was not trained for this purpose. This is precisely the effect
observed in Figure 2d. The values of pC (5) in Figure 2 reflects the improvement. Here we expect the MAE embedding
to present a clear advantage when applying distance-based algorithms, or even shallow learners. We will test the latter
case in the next section.
Similar trends emerge for T3 (green lines). In this case, we see a more pronounced decrease of pC (k) as K increases.
This dataset has few samples and the neighbourhood quickly starts including samples of other classes as K grows.
Considering T2 (blue lines), we observe similar results when K ≤ 3, with less polluted neighbourhoods when using
MAE embeddings. Interestingly, as K increases, MAE neighbourhoods become more polluted than those resulting
from simple concatenation. That is, the concatenation of measurements could be sufficient to map the samples of the

13
DNN 0.87

Macro avg. F1-Score


Features
RF 0.86

7-NN 0.83

0.0 0.2 0.4 0.6 0.8 1.0


F1-Score

Figure 6: Classification performance in T1 comparing deep and shallow models with the MAE encoded features.

same class in the same areas of the metric space. Here, the compression performed by the MAE slightly confuses the
neighbourhoods of samples, which, however, still exhibit high pC .

Takeaway: Although not trained for this purpose, the embeddings learned by the MAE exhibit nice properties pushing
same-class samples closer to each other in the embedding space. This hints at their usefulness for downstream tasks but
also opens the opportunity for successfully using small and simple models to solve these tasks, which we verify next.

8 Shallow Models

One appealing property of good quality MAE encoded features is the possibility to use them for multiple downstream
tasks, eventually with simple off-the-shelf ML algorithms. In this section, we show that the MAE encoded features
allow simple shallow learning models to deliver good classification performance.
We repeat the supervised classification experiments of the previous sections using shallow learners with the MAE
encoded features. We select two simplistic algorithms, namely (i) a Random Forest (RF) [16] built using the Gini
impurity index; and (ii) a K-Nearest-Neighbors (K-NN) classifier. For this experiment, we set K = 7, similarly to [12].
Results for the MIRAGE dataset are summarised in Figure 6. Numbers in the right-hand axis of the figure report the
macro average F1-Score. Notice in Figure 6 that the Deep Neural Network (DNN) model, RF, and K-NN results are all
compatible. The more powerful DNN model still provides better figures, with 87% macro average F1-Scores, against
86% (83%) for the RF (7-NN). Yet, the goodness of the shallow models with the MAE is worth noticing. That is, the
compact representation obtained with the MAE allows good performance in this downstream task even with the more
lightweight shallow models.

Takeaway: Similar to other fields where representation learning is widely used [42], the condensed embedding extracted
by MAE allows simpler models to be competitive with deep models, compared to the raw concatenation of features.

9 Limitations

Our work has limitations and unexplored questions which we summarise here. First, while we have searched for the
best parameters for the MAE architecture we adopted, multiple other architectures could be employed for the same
purpose. For example, we have listed in Section 2 different alternatives for the adaptation modules (e.g. Graph Neural
Network embeddings). These approaches could enhance the representations learned from network traffic. We have not
tested such alternatives. Equally, finding the best NN architecture to serve as the Integration Module is still an open
problem that we do not tackle in this work.

14
Second, our work could be seen as a first step towards the creation of foundation network model, i.e., models trained on
vast amounts of network data to solve a wide range of problems and use cases. Given the robust embedding properties
generated by MAE, we believe these representations hold significant promise in this direction, but other approaches are
promising too. To arrive at such ambitious goals, we need to systematically extend the methodology to more use cases
and datasets, which we do not face in this work.
Third, practical traffic analysis problems are known to suffer from both temporal and spatial generalisation problems
– that is, models learned from a network in a specific time may not apply to other networks or different time frames
(e.g., due to drift). Moreover, it is known that the traffic mix changes continuously, e.g., with the appearance of new
applications and protocols, which change the observable entities and quantities. Here we do not investigate the impact
of drift on the MAE.
Finally, we illustrate the application of the MAE using traffic classification problems. Traffic classification is still
possible thanks to plain text fields present on protocol headers and payload as well as thanks to side channels, such as
inter-packet times and packet sizes. The deployment of more techniques and protocols to improve online privacy would
affect the MAE representations as for other traffic classification approaches.

10 Conclusions
In this paper, we proposed a generic architecture to learn vector representations from network data for traffic analysis.
First, noticing a bi-modality in collected data where several entities co-exist with several measured quantities, we
advocate the systematic use of NLP word embedding techniques to unsupervisedly extract rich entity features prone
to machine learning models. We further propose to use one adaptation module per measurement type, trained with
self-supervision to learn salient features from various quantities, before compressing all entity and quantity features by
leveraging an auto-encoder. We objectively evaluate our approach and the quality of our learned embeddings on three
different traffic classification downstream tasks, using open datasets for which we have labels and baseline results. Our
results (i) confirm the usefulness of entity embeddings, often under-used in traffic analysis tasks, (ii) show that our
compressed multi-modal embeddings do not lose information when fed to downstream tasks while streamlining the
learning pipeline. We open all our MAE models, together with our source code, in the hope of sparking the community
for the search for better models and benchmarks, and for the eventual future development of foundation models for
network applications.

Acknowledgements
The research leading to these results has been partly funded by the Huawei R&D Center (France), by the project
SERICS (SEcurity and RIghts In the CyberSpace - PE00000014) under the MUR National Recovery and Resilience
Plan funded by the European Union, as well as the ACRE (AI-Based Causality and Reasoning for Deceptive Assets
- 2022EP2L7H) and xInternet (eXplainable Internet - 20225CETN9) projects - funded by European Union - Next
Generation EU within the PRIN 2022 program (D.D. 104 - 02/02/2022 Ministero dell’Università e della Ricerca). This
manuscript reflects only the authors’ views and opinions and the Ministry cannot be considered responsible for them.

References
[1] VPN 2016 | Datasets | Research | Canadian Institute for Cybersecurity | UNB. URL https://www.unb.ca/cic/
datasets/vpn.html.
[2] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico, and Antonio Pescapé. MIRAGE:
Mobile-app Traffic Capture and Ground-truth Creation. In 2019 4th International Conference on Computing,
Communications and Security (ICCCS), 2019. doi: 10.1109/CCCS.2019.8888137.
[3] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapè. MIMETIC: Mobile encrypted traffic
classification using multimodal deep learning. Computer Networks, 2019. doi: 10.1016/j.comnet.2019.106944.
[4] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Mobile Encrypted Traffic Classifi-
cation Using Deep Learning: Experimental Evaluation, Lessons Learned, and Challenges. IEEE Transactions on
Network and Service Management, 2019. doi: 10.1109/TNSM.2019.2899085.
[5] Richard Bellman. Dynamic Programming. Science, 1966. doi: 10.1126/science.153.3731.34.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark

15
Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. In Advances in
Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine
Translation, 2014.
[8] Dvir Cohen, Yisroel Mirsky, Manuel Kamp, Tobias Martin, Yuval Elovici, Rami Puzis, and Asaf Shabtai. DANTE:
A Framework for Mining and Monitoring Darknet Traffic. In Computer Security – ESORICS 2020, 2020. doi:
10.1007/978-3-030-58951-6_5.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, 2019.
[10] Mohamed Amine Ferrag, Leandros Maglaras, Sotiris Moschoyiannis, and Helge Janicke. Deep learning for cyber
security intrusion detection: Approaches, datasets, and comparative study. Journal of Information Security and
Applications, 2020. doi: 10.1016/j.jisa.2019.102419.
[11] Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A Survey on Deep Learning for Multimodal Data Fusion.
Neural Computation, 2020. doi: 10.1162/neco_a_01273.
[12] Luca Gioacchini, Luca Vassio, Marco Mellia, Idilio Drago, Zied Ben Houidi, and Dario Rossi. DarkVec: automatic
analysis of darknet traffic with word embeddings. In Proceedings of the 17th International Conference on emerging
Networking EXperiments and Technologies, 2021. doi: 10.1145/3485983.3494863.
[13] Luca Gioacchini, Andrea Cavallo, Marco Mellia, and Luca Vassio. Exploring Temporal GNN Embeddings for
Darknet Traffic Analysis. In Proceedings of the 2nd on Graph Neural Networking Workshop 2023, 2023. doi:
10.1145/3630049.3630175.
[14] Luca Gioacchini, Luca Vassio, Marco Mellia, Idilio Drago, Zied Ben Houidi, and Dario Rossi. i-DarkVec:
Incremental Embeddings for Darknet Traffic Analysis. ACM Transactions on Internet Technology, 2023. doi:
10.1145/3595378.
[15] Roberto Gonzalez, Filipe Manco, Alberto Garcia-Duran, Jose Mendes, Felipe Huici, Saverio Niccolini, and
Mathias Niepert. Net2Vec: Deep Learning for the Network. In Proceedings of the Workshop on Big Data Analytics
and Machine Learning for Data Communication Networks, 2017. doi: 10.1145/3098593.3098596.
[16] Tin Kam Ho. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis
and Recognition, 1995. doi: 10.1109/ICDAR.1995.598994.
[17] Jordan Holland, Paul Schmitt, Nick Feamster, and Prateek Mittal. New Directions in Automated Traffic Analysis.
In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021. doi:
10.1145/3460120.3484758.
[18] Eyal Horowicz, Tal Shapira, and Yuval Shavitt. A few shots traffic classification with mini-FlowPic augmentations.
In Proceedings of the 22nd ACM Internet Measurement Conference, 2022. doi: 10.1145/3517745.3561436.
[19] Zied Ben Houidi, Raphael Azorin, Massimo Gallo, Alessandro Finamore, and Dario Rossi. Towards a systematic
multi-modal representation learning for network data. In Proceedings of the 21st ACM Workshop on Hot Topics in
Networks, 2022. doi: 10.1145/3563766.3564108.
[20] Ting-Li Huoh, Yan Luo, Peilong Li, and Tong Zhang. Flow-Based Encrypted Network Traffic Classification
With Graph Neural Networks. IEEE Transactions on Network and Service Management, 2023. doi: 10.1109/
TNSM.2022.3227500.
[21] Jonas Höchst, Lars Baumgärtner, Matthias Hollick, and Bernd Freisleben. Unsupervised Traffic Flow Classification
Using a Neural Autoencoder. In 2017 IEEE 42nd Conference on Local Computer Networks (LCN), 2017. doi:
10.1109/LCN.2017.57.
[22] Michalis Kallitsis, Vasant Honavar, Rupesh Prajapati, Dinghao Wu, and John Yen. Zooming Into the Darknet:
Characterizing Internet Background Radiation and its Structural Changes, 2021.
[23] Michalis Kallitsis, Rupesh Prajapati, Vasant Honavar, Dinghao Wu, and John Yen. Detecting and Interpreting
Changes in Scanning Behavior in Large Network Telescopes. IEEE Transactions on Information Forensics and
Security, 2022. doi: 10.1109/TIFS.2022.3211644.
[24] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization, 2017.

16
[25] Shengbao Li, Yuchuan Huang, Tianye Gao, Lanqi Yang, Yige Chen, Quanbo Pan, Tianning Zang, and Fei Chen.
FusionTC: Encrypted App Traffic Classification Using Decision-Level Multimodal Fusion Learning of Flow
Sequence. Wireless Communications and Mobile Computing, 2023. doi: 10.1155/2023/9118153.
[26] Xinjie Lin, Gang Xiong, Gaopeng Gou, Zhen Li, Junzheng Shi, and Jing Yu. ET-BERT: A Contextualized
Datagram Representation with Pre-training Transformers for Encrypted Traffic Classification. In Proceedings of
the ACM Web Conference 2022, 2022. doi: 10.1145/3485447.3512217.
[27] Mohammad Lotfollahi, Mahdi Jafari Siavoshani, Ramin Shirali Hossein Zade, and Mohammdsadegh Saberian.
Deep packet: a novel approach for encrypted traffic classification using deep learning. Soft Computing, 2020. doi:
10.1007/s00500-019-04030-2.
[28] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in
Vector Space, 2013.
[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Rep-
resentations of Words and Phrases and their Compositionality. In Advances in Neural Infor-
mation Processing Systems, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/
9aa42b31882ec039965f3c4923ce901b-Abstract.html.
[30] Preeti Mishra, Vijay Varadharajan, Uday Tupakula, and Emmanuel S. Pilli. A Detailed Investigation and Analysis
of Using Machine Learning Techniques for Intrusion Detection. IEEE Communications Surveys & Tutorials, 2019.
doi: 10.1109/COMST.2018.2847722.
[31] Alfredo Nascita, Antonio Montieri, Giuseppe Aceto, Domenico Ciuonzo, Valerio Persico, and Antonio Pescapé.
XAI Meets Mobile Traffic Classification: Understanding and Improving Multimodal Deep Learning Architectures.
IEEE Transactions on Network and Service Management, 2021. doi: 10.1109/TNSM.2021.3098157.
[32] Minh Nguyen, Gia H. Ngo, and Nancy F. Chen. Hierarchical Character Embeddings: Learning Phonological and
Semantic Representations in Languages of Logographic Origin Using Recursive Neural Networks. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 2020. doi: 10.1109/TASLP.2019.2955246.
[33] Keiron O’Shea and Ryan Nash. An Introduction to Convolutional Neural Networks, 2015.
[34] Fannia Pacheco, Ernesto Exposito, Mathieu Gineste, Cedric Baudoin, and Jose Aguilar. Towards the Deployment
of Machine Learning Solutions in Network Traffic Classification: A Systematic Survey. IEEE Communications
Surveys & Tutorials, 2019. doi: 10.1109/COMST.2018.2883147.
[35] Bo Pang, Yongquan Fu, Siyuan Ren, Siqi Shen, Ye Wang, Qing Liao, and Yan Jia. A Multi-Modal Approach
For Context-Aware Network Traffic Classification. In ICASSP 2023 - 2023 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2023. doi: 10.1109/ICASSP49357.2023.10095124.
[36] Shahbaz Rezaei and Xin Liu. Deep Learning for Encrypted Traffic Classification: An Overview. IEEE Communi-
cations Magazine, 2019. doi: 10.1109/MCOM.2019.1800819.
[37] Shahbaz Rezaei and Xin Liu. How to Achieve High Classification Accuracy with Just a Few Labels: A Semi-
supervised Approach Using Sampled Packets, 2020.
[38] Markus Ring, Alexander Dallmann, Dieter Landes, and Andreas Hotho. IP2Vec: Learning Similarities Between
IP Addresses. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017. doi:
10.1109/ICDMW.2017.93.
[39] Bushra Sabir, Faheem Ullah, M. Ali Babar, and Raj Gaire. Machine Learning for Detecting Data Exfiltration: A
Review. ACM Computing Surveys, 2021. doi: 10.1145/3442181.
[40] Amin Shahraki, Mahmoud Abbasi, Amir Taherkordi, and Mohammed Kaosar. Internet Traffic Classification Using
an Ensemble of Deep Convolutional Neural Networks. In Proceedings of the 4th FlexNets Workshop on Flexible
Networks Artificial Intelligence Supported Network Flexibility and Agility, 2021. doi: 10.1145/3472735.3473386.
[41] Keras Team. Keras documentation: Keras layers API. URL https://keras.io/api/layers/.
[42] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, and Phillip Isola. Rethinking Few-Shot
Image Classification: A Good Embedding is All You Need? In Computer Vision – ECCV 2020, 2020. doi:
10.1007/978-3-030-58568-6_16.
[43] Laurens Van der Maaten and Geoffrey Hinton. Visualizing Data Using t-SNE. Journal of Machine Learning
Research, 2008.
[44] Ly Vu, Cong Thanh Bui, and Quang Uy Nguyen. A Deep Learning Based Method for Handling Imbalanced
Problem in Network Traffic Classification. In Proceedings of the 8th International Symposium on Information and
Communication Technology, 2017. doi: 10.1145/3155133.3155175.

17
[45] Wei Wang, Ming Zhu, Jinlin Wang, Xuewen Zeng, and Zhongzhen Yang. End-to-end encrypted traffic classification
with one-dimensional convolution neural networks. In 2017 IEEE International Conference on Intelligence and
Security Informatics (ISI), 2017. doi: 10.1109/ISI.2017.8004872.
[46] Wei Wang, Yiqiang Sheng, Jinlin Wang, Xuewen Zeng, Xiaozhou Ye, Yongzhong Huang, and Ming Zhu.
HAST-IDS: Learning Hierarchical Spatial-Temporal Features Using Deep Neural Networks to Improve Intrusion
Detection. IEEE Access, 2018. doi: 10.1109/ACCESS.2017.2780250.
[47] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A Comprehensive
Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 2021. doi:
10.1109/TNNLS.2020.2978386.
[48] Yang Yang, Yu Yan, Zhipeng Gao, Lanlan Rui, Rui Lyu, Bowen Gao, and Peng Yu. A Network Traffic
Classification Method Based on Dual-Mode Feature Extraction and Hybrid Neural Networks. IEEE Transactions
on Network and Service Management, 2023. doi: 10.1109/TNSM.2023.3262246.
[49] Ruijie Zhao, Mingwei Zhan, Xianwen Deng, Yanhao Wang, Yijun Wang, Guan Gui, and Zhi Xue. Yet Another
Traffic Classifier: A Masked Autoencoder Based Traffic Transformer with Multi-Level Flow Representation.
Proceedings of the AAAI Conference on Artificial Intelligence, 2023. doi: 10.1609/aaai.v37i4.25674.

A Background on Adaptation Modules

In this section, we provide some background about the architectures adopted as adaptation modules. Notice that we
describe the architectures in each subsection independently, thus repeating the mathematical notation for clarity within
its context.

A.1 Convolutional Neural Network (CNN)

A Convolutional Neural Network [33] operates through a series of layers designed to detect spatial hierarchies of
features within input data. Mathematically, the operations conducted within a CNN involve convolutions, nonlinear
activations, pooling, and fully connected layers, each contributing uniquely to the network’s ability to understand
intricate patterns and representations.
Consider a 2-D input sample (e.g. an image) with R rows and C columns, X ∈ RR×C . The core operation in a CNN is
the convolution, which relies on a kernel K ∈ Rn1 ×n2 .
The first stage of the convolution consists of obtaining a feature map F ∈ RR−n1 ×C−n2 by applying the kernel K on
the input sample X. Hence, the ij-th entry of the feature map is obtained as:
F [i, j] = (X ⊙ K)[i,j]
After the application of the kernel on the full input sample, the full convolution operation consists of applying an
activation function ϕa . Hence, the output of the convolution is obtained as χ = ϕa (F ).
After convolution, CNNs usually employ pooling layers. They downsample the feature map, reducing dimention while
retaining important information. Max pooling, for instance, selects the maximum value within a defined window.
Pooling relies on two parameters: p, indicating the height and width of the pooling window, and the stride s, indicating
the number of values the pooling window moves at each step. Hence, the ij-th component of the pooled sample P p1 ×p2
is defined as P [i, j] = max (χ[i : i + p , j : j + p]), where p1 = R−ns1 −p + 1 and p2 = C−ns2 −p + 1.
After multiple passes of convolution and pooling, the resulting output sample is then flattened and processed through
one (or multiple) fully connected layers. In Figure 7 we provide an overview of a single convolution-pooling step for
one input sample.

A.2 Word2Vec

Word2Vec [28, 29] is a NLP technique based on artificial neural networks. It allows to map words (tokens) of text
sentences (corpora) into a latent space as a real-valued array (the embedding), such that words belonging to similar
contexts have similar embedding.
The core element of the Word2Vec model is the context. It is defined as the sequence of words surrounding the one for
which the embedding must be generated. The number of words to consider in the context is specified by the context

18
Input Convolution Pooling Fully
Sample Layer Layer Connected

n1 K
n2
R-n1 p1
R
p2
C-n2
p1×p2

Figure 7: Convolutional Neural Network overview.

1
0
...
-c
0
Word1 0
WNxE W'ExN 0
Word2 1
Reverse Output Layer
... Embeddings ... embeddings N-dim
lookup table
lookup table ...
WordN-1 0
WordN 0 0

Input Layer 0
N-dim Hidden Layer ...
E-dim +c
1
0

Output Layer
N-dim

Figure 8: Word2Vec skipgram architecture.

window size c. For example, by considering the sentence ’Chicago is a great city’, if the word ’a’ is the target one and
c = 2, the context for ’a’ is the list of the 2 previous and 2 following words of ’a’:

’a’ → (’Chicago’, ’is’, ’great’, ’city’)

To generate the embedding, Word2Vec relies on two possible architectures: skip-grams and Continuous Bag Of Words
(CBOW). Since we work with the skip-gram architecture, for the sake of simplicity we omit the description of CBOW.
By considering a corpus with N distinct words, the model aims to predict the probability of finding each one of the N
words within the context window of a given target word. In Figure 8 we report an overview of the skip-gram architecture.
Each word of the sentences is fed as input to the model through a one-hot-encoded input layer. The E-dimensional
hidden layer links all the 2c context words to the target one. After the model training, the embedding is obtained from
the weights matrix W ∈ RN ×E . Each of the i ∈ {1, . . . , N } entries of W is the embedding in R1×E associated to the
i-th word.

A.3 Gated Recurrent Unit (GRU)

Gated Recurrent Units (GRUs) [7] is a particular type of Recurrent Neural Network (RNN), a family of neural networks
designed for handling sequences of data. Unlike traditional models, GRUs have built-in mechanisms that help them
decide what information to remember or forget, making them great for understanding patterns in sequences like text,
audio, or time series data.
The GRU maintains a hidden state to keep track of past information and merges it with new incoming data producing a
latent representation of the input data. Formally, let’s consider a sequence of data extracted from a dynamic system
{X t }Tt=0 , of length T . At each time interval t, the GRU receives as input a sample xt ∈ RF , where F is the number of

19
ht-1 × + ht

1-

× ×
rt ct h't
σ σ tanh

xt

Figure 9: Gated Recurrent Unit architecture.

features,7 , and the previous hidden state ht−1


u ∈ RE , where E is the size of the output vector provided by the GRU.
It then transforms the two inputs into the new hidden state ht ∈ RE through two gates, the update and reset gate.
The update gate (output ct ∈ RE ) is used to balance the influence of the past hidden state and the newly acquired
information on the output. In this way, the GRU provides a direct connection between initial input and output which
helps modelling long-term dependencies and improves gradient flow. The reset gate rt ∈ RE modulates the contribution
of the past hidden state to the newly added information.
Formally,
ct = σ(W c xt + U c ht−1 )
rt = σ(W r xt + U r ht−1 )

ht = tanh (W xt + U (rt ⊙ ht−1 ))
ht = ct ⊙ ht−1 + (1 − ct ) ⊙ h′t
where W c , W r , W ∈ RE×F and U c , U r , U ∈ RE×E are learnable matrices, ⊙ is the element-wise product and σ is
the sigmoid function. For the sake of completeness, we report in Figure 9 the architecture of a single GRU cell.

B Architectures of the Adaptation Modules


We now complement the details of the architectures and implementation of the adaptation modules. In Table 6 we
report details about the implementation of the adaptation modules. For the sake of simplicity, we report the layer names
provided by the Keras API [41]. We use ReLU as the activation function for all hidden layers.

C MAE parameter tuning


We have seen in Section 3 that the MAE is defined by multiple parameters (e.g. l1 , . . . , l4 ), which are summarised
in Table 6. When designing the MAE we investigate the impact of these parameters using classic grid-searching
methodologies. Here we provide a summary of the results.
Figure 10 summarises the impact of the adaptation space size l1 and bottleneck size l4 for T2 . Figure 10a shows how
the macro average F1-Scores are impacted by these parameters, while Figure 10b provides the size of the obtained
autoencoder, defined as the number of internal weights and biases (e.g. trainables). We see that varying the size of l1
and l4 has a minor impact on the average F1-Scores. The best result (l1 = 32 and l4 = 64) is 87% for T2 , which is
only marginally better than the worst-tested cases. Figure 10b instead shows that these parameters have a large impact
on the size of the autoencoder, thus calling for moderation when picking these dimensions.8
Overall, we have not identified a major impact on F1-Scores when varying these parameters in all three downstream
tasks. As our goal is to produce a representation that is generic and reusable for multiple problems, we argue that the
MAE should not be over-optimised for a single downstream task. Indeed, generic parameters that fit multiple tasks
well, as above, should be preferred.

7
Notice that the GRU can process a matrix sized N × F , where N is the number of samples. For the sake of simplicity, we here
report the formulation for only one sample. The generalisation is straightforward
8
Training and validating one model with the DARKNET dataset (44k samples) takes around 5 minutes, using 150 epochs in a single
Nvidia Tesla V100 GPU with 16GB of memory.

20
Table 6: Architectures of the adaptors and encoders
Feature Stage Layer Type Units Note
GRU 32 Return sequences = True
GRU 32 Return sequences = False
Encoding
Dense 64
Dense l1 By default l1 =32
Subnet
Dense l1 By default l1 =32
Dense 64
Decoding RepeatVector – 4 times
GRU 32 Return sequences = True
Dense 1 TimeDistributed
Embedding 64 Vocabulary size: 257
Masking – Padding value:0
GRU 64 Return sequences = True
Encoding
GRU 32 Return sequences = False
Dense 64
Payload Dense l1 By default l1 =32
Dense l1 By default l1 =32
Dense 64
Decoding RepeatVector – 32 times
GRU 64 Return sequences = True
Dense 1 TimeDistributed
Conv1D – Filters: 32; Kernel size: 3
MaxPooling1D – Pool size: 2
Encoding
Flatten –
Dense l1 By default l1 =32
Dense l1 By default l1 =32
Sequences
Dense 480
Conv1D – Filters: 32; Kernel size: 3
Decoding UpSampling1D – Size=2
Conv1D – Filters: 4; Kernel size: 3
UpSampling1D – Size=2
Dense 32x4
IP address, Ports Encoding,
Dense l1 By default l1 =32
Statistics Decoding

0.87
128 0.85 0.86 0.86 0.86 128 62.7 63.1 64.0 65.6 60
Adaptation space size l1
Adaptation space size l1

# Trainables ×10−4
Macro avg. F1-Score

64 0.85 0.86 0.87 0.87 0.86 64 41.4 41.8 42.6 44.3 50

32 0.85 0.86 0.87 0.87 32 30.8 31.2 32.0 33.6 40


0.85

16 0.84 0.86 0.86 0.87 16 25.4 25.8 26.7 28.3 30


0.84
16 32 64 128 16 32 64 128
Bottleneck size l4 Bottleneck size l4

(a) Macro F1-Score. (b) Size of the neural network (encoder).


Figure 10: Impact of parameter choices.

21

You might also like