Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SSRN Id4661706

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ed

ITC-Net-Blend-60: A Comprehensive Dataset for Robust Network Traffic Classification in


Diverse Environments

iew
Authors

Marziyeh Bayat1, Javad Garshasbi1, Mozhgan Mehdizadeh1, Neda Nozari1, Abolghasem Rezaei Khesal1,
Maryam Dokaei1, Mehdi Teimouri1

ev
Affiliations

1. Information Theory and Coding Laboratory, University of Tehran, Tehran, Iran

r
Corresponding author(s)

Mehdi Teimouri (mehditeimouri@ut.ac.ir)


er
pe
Abstract

Recognition of mobile applications within encrypted network traffic holds considerable ramifications
across multiple domains, encompassing network administration, security, and digital marketing. The
creation of network traffic classifiers capable of adjusting to dynamic and unforeseeable real-world
ot

settings presents a tremendous challenge. Presently available datasets exclusively encompass traffic data
obtained from a singular network environment, thereby restricting their utility in evaluating the resilience
and adaptability of a given model. We have gathered a network traffic dataset from over 50 Android
tn

applications in five network scenarios to overcome this limitation. The dataset includes 1,163 PCAP traces
containing 37 GB of network traffic data. It is more representative of real-world network traffic and can
serve as a valuable resource for developing classifiers that are robust and compatible with real-world
network environments.
rin

Keywords
ep

Network Traffic Analysis; Traffic Classification; Application Identification; Mobile-app Fingerprinting;


Encrypted Traffic; Android Applications; Robustness; Various Network Environments; Raw Labeled
Dataset
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Specifications Table

ed
Subject Computer Networks and Communications

iew
Specific subject area Network Traffic Analysis

Type of data Packet Capture (Pcap)

r ev
How data were acquired Hardware:
- Laptop with dual-band network card
- Smartphone
Software:
- Wireshark
er
pe
- PCAPdroid
- Proton VPN
Data format Raw
ot
tn

Parameters for data We conducted the data collection process under five different scenarios.
collection While keeping the applications the same across all scenarios, each
scenario varied in terms of ISP, location, device (vendor, model, OS
rin

version), application version, and user.


Description of data To collect the traffic data, several experiments were conducted where
collection participants used a single app on a specific smartphone for 3 to 15
minutes. The network traffic generated during each experiment was
ep

captured using Wireshark and Pcapdroid simultaneously, then filtered out


the background traffic.
Data source location City/Town/Region: Tehran, Karaj, Qom
Country: Iran
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Data accessibility Repository name: ITC-Net-Blend-60- Scenario A

ed
Data identification number: 10.17632/ssv23kfcgs.2
Direct URL to data: https://data.mendeley.com/datasets/ssv23kfcgs/2

iew
Repository name: ITC-Net-Blend-60- Scenario B
Data identification number: 10.17632/3zggb53m4x.2
Direct URL to data: https://data.mendeley.com/datasets/3zggb53m4x/2

Repository name: ITC-Net-Blend-60- Scenario C


Data identification number: 10.17632/gp8r347j38.2

ev
Direct URL to data: https://data.mendeley.com/datasets/gp8r347j38/2

Repository name: ITC-Net-Blend-60- Scenario D

r
Data identification number: 10.17632/mcmf627yh5.2
Direct URL to data: https://data.mendeley.com/datasets/mcmf627yh5/2

er
Repository name: ITC-Net-Blend-60- Scenario E
Data identification number: 10.17632/gdtnnfyr7s.2
pe
Direct URL to data: https://data.mendeley.com/datasets/gdtnnfyr7s/2

Repository name: ITC-Net-Blend-60- Supplementary Materials


Data identification number: 10.17632/4sgt9tjs4w.3
Direct URL to data: https://data.mendeley.com/datasets/4sgt9tjs4w/3
ot

Related research article A. R. Khesal and M. Teimouri, “The Effect of Network Environment on
tn

Traffic Classification,” in proc ICCKE 2022, 17-18 Nov. 2022, Mashhad, pp.
59-64. DOI: 10.1109/ICCKE57176.2022.9960138.
rin

Value of the Data


ep

 Why are these data useful?


Pr

One of the key challenges in this field is developing robust models that function appropriately in
real network environments [1, 2]. As indicated in Table A1 of Appendix, existing datasets are

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
mainly obtained from a single invariant network environment, which cannot properly evaluate

ed
model robustness. Our dataset was captured across five different network scenarios, with various
factors that can affect network traffic behavior [3, 4]. Furthermore, this data was generated
through real human interactions on actual smartphones and captured using a non-rooted
method. This makes the data more representative and suitable for mobile app traffic analysis.

iew
 Who can benefit from these data?

This data can provide valuable insights to those interested in improving network performance and
understanding mobile app users’ behavior. Some of the potential beneficiaries are Internet
service providers (ISPs), network administrators, marketing companies, and security agencies [1,

ev
3, 5].

 How can these data be used for further insights and development of experiments?

r
This large and diverse dataset can be used to develop and evaluate a wide range of network traffic
analysis tools. Given that the data are released in raw format (PCAP files), researchers have the

method, or even entirely novel approaches. er


freedom to design models based on any traffic object (e.g. packet, flow, and bag of flows), feature,

Importantly, the various network scenarios presented in the dataset can be used to evaluate
pe
model robustness and generalization. We recommend training models on four scenarios and
testing them on the remaining one, repeating this process for all permutations and reporting the
results using the formula described in reference [6]. This cross-validation process will yield
valuable insights into how well models generalize to different network conditions.
ot

Data
tn

This dataset is organized into separate repositories for each scenario. Each repository includes a dedicated
compressed file for every application. These compressed files contain the corresponding PCAP files. A
visual representation of the dataset structure is shown in Figure 1. Consistency in file naming conventions
rin

has been maintained across the entire dataset. All PCAP files have been named using the following format:
{Application Name}_{Scenario ID}_{#Trace}_Final. pcap.
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Dataset:

ed
|- Scenario A

|-- App 1.rar


|--- App1_A_1_Final.pcap

iew
|--- App1_A_2_Final.pcap
|--- ...

+-- App 2.rar

+-- ...

ev
+- Scenario B
+- Scenario C
+- Scenario D
+- Scenario E

r
Figure 1. structure of the dataset

er
The whole dataset consists of 1,163 PCAP traces capturing 37 GB of network traffic data. Table 1 provides
pe
more details about the dataset and each scenario (the code and documents are available in the
supplementary material).

Scenario Apps No. Pcap No. No. Bi- Sum Capture


ot

Deficiency
ID number Files Packets Flows* Duration (h)
A 59 233 13,823,338 108,370 27.94 Whatsapp Business
tn

B 60 236 5,862,397 72,279 19.42 -

C 59 235 13,689,836 141,957 38.36 Adobe Connect

D 59 248 12,133,832 106,652 38.53 Whatsapp Business


rin

Discord, Dropbox, LinkedIn,


E 52 207 2,726,719 47,044 15.06 Microsoft Outlook, Telewebion,
Twitter, Waze, Whatsapp Business
Total 60 1159 48,236,122 476,302 139.31
* The threshold of flows is set to one second.
ep

Table 1. The dataset specifications

It is important to note that human errors during the data collection process led to the corruption of some
Pr

application files. As a result, only 51 of the initial 60 applications remained consistent across all scenarios.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
We have decided to preserve and publish these non-shared applications because they have utility for

ed
other purposes, except for the robustness evaluation. This set of applications can be found in a folder
named "non-shared" within the repositories.

iew
Experimental Design, Materials, and Methods

The methodology employed for collecting the dataset comprised three main stages: Application Selection,
Traffic Capture Setup, and Traffic Generation. In the application selection phase, we chose the applications
to monitor. Next, in the traffic capture setup phase, we set up the framework to capture network traffic

ev
data. Finally, in the traffic generation phase, we generated the actual network traffic data from the
selected applications. Details on each phase will be provided in subsequent sections.

Application Selection

r
Since it is not feasible to capture the traffic of all applications, we considered the top 300 free Android
apps listed in October 2021 in the Google Play Store and two major Iranian Android app markets, Cafe

er
Bazaar1, and Myket2. From these, we selected a subset of 60 applications based on two criteria:

1. The application's main activity relied on an Internet connection


pe
2. The application generated traffic through user interactions

The chosen applications spanned 16 different categories. The complete list of the selected 60 applications
and additional information can be found in Table A2 in Appendix A.
ot

Traffic Capture Setup


tn

As depicted in Figure 2, our traffic capture setup included a smartphone and a laptop. We used a laptop
running Windows 10 with an internal dual-band network card and installed Wireshark software on it. The
laptop was connected to the internet and shared its connection with the smartphone via a hotspot. Then
we configured Wireshark to capture traffic through the "Local Area Connection" interface. This setup
rin

enabled the smartphone to access the internet through the laptop's connection, allowing Wireshark to
capture the smartphone's network traffic.
ep
Pr

1 Cafe Bazaar. Available online: https://cafebazaar.ir/app/(accessed on 21 May 2023).


2 Myket. Available online: https://myket.ir/(accessed on 21 May 2023).

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
ed
iew
r ev
er
pe
ot

Figure 2. Traffic capture setup


tn

The traffic captured by Wireshark contained significant background traffic [7]. To isolate the target
application's network traffic, we installed PCAPdroid on the smartphone. Since root access could modify
application behavior, we used PCAPdroid in non-root mode. In this mode, PCAPdroid does not use a
rin

remote VPN server; instead, it simulates a VPN to capture the network traffic and processes data locally
on the device.

While PCAPdroid can capture an individual application's traffic, it modifies network layers 3 and 4 of
ep

packets, preventing its independent use. Therefore, we used Wireshark and PCAPdroid simultaneously to
record the target application's traffic. Wireshark provided an unaltered packet capture, while PCAPdroid
isolated traffic originating from the application. The two tools complemented each other to provide an
accurate capture of the traffic.
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Unfortunately, several applications had restrictions on accessing their servers in Iran, which required us

ed
to use a VPN connection to execute them properly. These applications are identified in Table A2. For this
purpose, we installed the free version of ProtonVPN on the laptop and configured its protocol to
OpenVPN-UDP while setting the VPN network driver to the TAP adapter. To ensure that recorded packets
were not modified by the VPN, we captured traffic on the "local area connection" interface before

iew
entering the VPN connection (Figure 3). As a result, all application traffic was accurately recorded,
including traffic generated by applications that require a VPN connection.

r ev
er
pe

Figure 3. Capture Setup with VPN


ot

After collecting traffic data, we separated the target application traffic from the background traffic
tn

through a pair-wise comparison of IP addresses and ports captured by Wireshark and PCAPdroid.
Specifically, for each trace, we compared each pair of IP addresses and ports captured by Wireshark with
all pairs captured by PCAPdroid. Any pairs in Wireshark that did not match a PCAPdroid pair were
identified as background traffic and were removed from the Wireshark data. In this way, we eliminated
rin

any irrelevant traffic and obtained ground truth without requiring root privileges on mobile phones.

We implemented this method in Python 3 using the Scapy library. The code for this implementation is
available in the Supplementary material.
ep

Traffic Generation

The traffic generation process was carried out by five volunteers from ITCLAB over six weeks, from
Pr

October to December 2021. Each volunteer collected traffic from a different network Scenario, which is
outlined in Table 2.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
ed
Device
Scenario
User Android Location* ISP*
ID Vendor Model
version
A U1 Xiaomi Note10 Pro 11 L1, L2 N1, N2

iew
B U2 Samsung A50 11 L1, L3 N1, N3
A31 11 L4 N2, N4
C U3 Samsung
Tab A7 Lite 11 L4 N2, N4
N1, N2,
D U4 Samsung J7 Prime 2 9 L1, L2, L5
N5

ev
J7 6.0.1 L6 N2, N6
E U5 Samsung
A12 11 L6 N2, N6

* L1 = ITC Lab L2 = District 5, Tehran L3 = District 11, Tehran L4 =Qom L5 = Karaj L6 = District 8, Tehran
N1 = University of Tehran N2 = TCI N3 =AsiaTech N4 =NTC N5 = Shatel N6 = MCI

r
Table 2. Scenarios Specifications

er
The ethical principles were strictly adhered to, and the volunteers were well-informed about the
pe
objectives of traffic capture and the public release of data in Pcap format. Additionally, they were allowed
to use their personal devices and had complete discretion over sharing their data.

Before commencing the data collection process, the volunteers received training on how to collect traffic.
Each volunteer was required to conduct at least three experiments for every application, with each
experiment consisting of interacting with a single app on a specific smartphone for 3 to 15 minutes.
ot

The volunteers were instructed to use the application as they normally would, to explore its
functionalities. For applications that required login, they were given the option to either create new
tn

accounts or use their personal ones.


rin

Acknowledgments

We wish to express our deepest gratitude to Mohammad Reza Tajzad for his invaluable insights and
ep

expertise, which significantly contributed to the success of our research. We also thank Fatemeh Delroba
for her assistance in the data collection process. Additionally, we would like to extend our heartfelt thanks
to Parastoo Soori for her helpful comment and suggestions on an earlier version of this manuscript, which
greatly improved it. Finally, we sincerely thank all the participants in this study for their time and
Pr

willingness to share their experiences. Without their contributions, our study would not have been
possible.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
References

ed
[1] T. van Ede et al., "FlowPrint: Semi-supervised mobile-app fingerprinting on encrypted network
traffic," in Network and Distributed System Security Symposium (NDSS), 2020, vol. 27.

iew
[2] [dataset] W. Li, X.-Y. Zhang, H. Bao, H. Shi, and Q. Wang, "ProGraph: Robust Network Traffic
Identification With Graph Propagation," IEEE/ACM Transactions on Networking, 2022.

[3] [dataset] V. F. Taylor, R. Spolaor, M. Conti, and I. Martinovic, "Robust smartphone app
identification via encrypted network traffic analysis," IEEE Transactions on Information Forensics
and Security, vol. 13, no. 1, pp. 63-78, 2017.

ev
[4] H. F. Alan and J. Kaur, "Can Android applications be identified using only TCP/IP headers of their
launch time traffic?," in Proceedings of the 9th ACM conference on security & privacy in wireless
and mobile networks, 2016, pp. 61-66.

r
[5] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, "Mobile encrypted traffic classification using
deep learning: Experimental evaluation, lessons learned, and challenges," IEEE Transactions on
Network and Service Management, vol. 16, no. 2, pp. 445-458, 2019.

[6] er
X. Gui, Y. Cao, I. You, L. Ji, Y. Luo, and Z. Luo, "A Survey of techniques for fine-grained web traffic
identification and classification," Mathematical Biosciences and Engineering, vol. 19, no. 3, pp.
pe
2996-3021, 2022.

[7] T. Stöber, M. Frank, J. Schmitt, and I. Martinovic, "Who do you sync you are? smartphone
fingerprinting via application behaviour," in Proceedings of the sixth ACM conference on Security
and privacy in wireless and mobile networks, 2013, pp. 7-12.
ot

[8] [dataset] J. S. Rojas, Á. R. Gallón, and J. C. Corrales, "Personalized service degradation policies on
OTT applications based on the consumption behavior of users," in Computational Science and Its
Applications–ICCSA 2018: 18th International Conference, Melbourne, VIC, Australia, July 2–5, 2018,
Proceedings, Part III 18, 2018: Springer, pp. 543-557.
tn

[9] [dataset] R. Wang, Z. Liu, Y. Cai, D. Tang, J. Yang, and Z. Yang, "Benchmark data for mobile app
traffic research," in Proceedings of the 15th EAI International Conference on Mobile and
Ubiquitous Systems: Computing, Networking and Services, 2018, pp. 402-411.
rin

[10] [dataset] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratantonio, V. Van Der Veen,


and C. Platzer, "Andrubis--1,000,000 apps later: A view on current Android malware behaviors,"
in 2014 third international workshop on building analysis datasets and gathering experience
returns for security (BADGERS), 2014: IEEE, pp. 3-17.
ep

[11] [dataset] G. Aceto, D. Ciuonzo, A. Montieri, V. Persico, and A. Pescapé, "MIRAGE: Mobile-app
traffic capture and ground-truth creation," in 2019 4th International Conference on Computing,
Communications and Security (ICCCS), 2019: IEEE, pp. 1-8.
Pr

[12] [dataset] J. Ren, D. Dubois, and D. Choffnes, "An International View of Privacy Risks for Mobile
Apps," ed, 2019.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
[13] [dataset] Y. Heng, V. Chandrasekhar, and J. G. Andrews, "UTMobileNetTraffic2021: A Labeled

ed
Public Network Traffic Dataset," IEEE Networking Letters, vol. 3, no. 3, pp. 156-160, 2021.

[14] PCAPdroid. Available online: https://github.com/emanuele-f/PCAPdroid/ (accessed on 21 May


2023).

iew
[15] ProtonVPN.Available online: https://protonvpn.com/(accessed on 21 May 2023).

r ev
er
pe
ot
tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Appendix A

ed
This section includes two tables A1 and A2, which are related to the comparison of the available datasets and
the list of selected applications, respectively.

iew
Mobile No. Real Capture No. Network No. Shared
No. No. Released
Dataset Apps Human Capture Span Session Environments Apps across
Apps Device Data
Traffic Users Duratin / Scenarios Scenarios
Traffic
Unicauca [8] 78 Six days in 2017 1
features
Traffic
Mobilegt [9]  12 10 October, 2016 - March, 2017 16 min 1
features

ev
- Raw (Pcap
Andrubis [10]  1M 2012.06.13 - 2016.03.25 1
(simulation) files)
5 - 10 Traffic
Mirage [11]  40 280 1 May 2017 - May 2019 1
min features
Raw (Pcap
Cross Market [12]  400 2017.08.28 - 2017.11.20 3 10
files)

r
Raw (Pcap
CrossNet2021 [2] 20 2 20
files)
- Traffic
Appscanner [3]  110 2 30 min 8 65

UTMobileNet2021[13]

ITC-Net-Blend-60


16

60
(simulation)
-
(simulation)
5
3

7
er
October - December 2021
3 - 15
min
features
Raw (Pcap
files)
Raw (Pcap
files)
3
(not itemized)
5
16

52
pe
*The cells that are left blank are due to the authors not providing any information in those specific areas.

Table A1. Summary of Available Datasets


ot

Applications
VPN
Category Name Package Name Metadata
Requirement
tn

Fidibo com.fidibo.app  Link


Books &
Taghche ir.mservices.mybook  Link
Reference
Goodreads com.goodreads  Link
Business Google Meet com.google.android.apps.meetings  Link
Gmail com.google.android.gm  Link
rin

Microsoft Outlook com.microsoft.office.outlook  Link


Skype com.skype.raider  Link
Google Chrome com.android.chrome  Link
Firefox Browser org.mozilla.firefox  Link
ep

Communication Whatsapp Messenger com.whatsapp  Link


Telegram org.telegram.messenger  Link
Whatsapp Business com.whatsapp.w4b  Link
iGap net.iGap  Link
Soroush Plus Messenger mobi.mmdt.ottplus  Link
Pr

Eitaa ir.eitaa.messenger  Link


Education Adobe Connect air.com.adobe.connectpro  Link

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Duolingo com.duolingo  Link

ed
Memrise com.memrise.android.memrisecompanion  Link
Coursera org.coursera.android  Link
Telewebion net.telewebion  Link
Youtube com.google.android.youtube  Link
Entertainment
Aparat com.aparat  Link

iew
Filimo com.aparat.filimo  Link
Ewano com.ebcom.ewano  Link
Finance
AP - Asan Pardakht com.sibche.aspardproject.app  Link
Clash of Clans com.supercell.clashofclans  Link
Football Strike com.miniclip.footballstrike  Link
Mencherz com.incyteltech.mencherz  Link

ev
Game
Quiz Of Kings co.palang.QuizOfKingss  Link
com.herocraft.game.mafioso.gangster.paradise.
Mafioso  Link
pvp
Lifestyle Pinterest com.pinterest  Link

r
Snapp cab.snapp.passenger.play  Link
Balad com.baladmaps  Link
Neshan org.rajman.neshan.traffic.tehran.navigator  Link
Maps & navigation
Tapsi
Google Maps
Waze
er taxi.tapsi.passenger
com.google.android.apps.maps
com.waze



Link
Link
Link
pe
Radio Javan com.radiojavan.androidradio  Link
Music & Audio Castbox fm.castbox.audiobook.radio.podcast  Link
Spotify com.spotify.music  Link
Facelab com.lyrebirdstudio.facelab  Link
Photography
ToonMe com.vicman.toonmeapp  Link
Dropbox com.dropbox.android  Link
ot

Productivity
OneDrive com.microsoft.skydrive  Link
Shopping Divar ir.divar  Link
Digikala com.digikala.diagon  Link
com.sheypoor.mobile
tn

Shopping Shaypur  Link


Torob ir.torob  Link
LinkedIn com.linkedin.android  Link
Snapchat com.snapchat.android  Link
Instagram com.instagram.android  Link
rin

Likee video.like  Link


Social
Facebook lite com.facebook.lite  Link
Clubhouse com.clubhouse.app  Link
Discord com.discord  Link
Twitter com.twitter.android&hl  Link
ep

Google Play Store com.google.vending  Link


Android app market Myket ir.mservices.market  Link
Bazaar com.farstitel.bazaar  Link
Tools InSave instagram.status.hd.images.video.downloader  Link
Pr

Table A2. List of Applications

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706

You might also like