SSRN Id4661706
SSRN Id4661706
SSRN Id4661706
iew
Authors
Marziyeh Bayat1, Javad Garshasbi1, Mozhgan Mehdizadeh1, Neda Nozari1, Abolghasem Rezaei Khesal1,
Maryam Dokaei1, Mehdi Teimouri1
ev
Affiliations
r
Corresponding author(s)
Recognition of mobile applications within encrypted network traffic holds considerable ramifications
across multiple domains, encompassing network administration, security, and digital marketing. The
creation of network traffic classifiers capable of adjusting to dynamic and unforeseeable real-world
ot
settings presents a tremendous challenge. Presently available datasets exclusively encompass traffic data
obtained from a singular network environment, thereby restricting their utility in evaluating the resilience
and adaptability of a given model. We have gathered a network traffic dataset from over 50 Android
tn
applications in five network scenarios to overcome this limitation. The dataset includes 1,163 PCAP traces
containing 37 GB of network traffic data. It is more representative of real-world network traffic and can
serve as a valuable resource for developing classifiers that are robust and compatible with real-world
network environments.
rin
Keywords
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Specifications Table
ed
Subject Computer Networks and Communications
iew
Specific subject area Network Traffic Analysis
r ev
How data were acquired Hardware:
- Laptop with dual-band network card
- Smartphone
Software:
- Wireshark
er
pe
- PCAPdroid
- Proton VPN
Data format Raw
ot
tn
Parameters for data We conducted the data collection process under five different scenarios.
collection While keeping the applications the same across all scenarios, each
scenario varied in terms of ISP, location, device (vendor, model, OS
rin
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Data accessibility Repository name: ITC-Net-Blend-60- Scenario A
ed
Data identification number: 10.17632/ssv23kfcgs.2
Direct URL to data: https://data.mendeley.com/datasets/ssv23kfcgs/2
iew
Repository name: ITC-Net-Blend-60- Scenario B
Data identification number: 10.17632/3zggb53m4x.2
Direct URL to data: https://data.mendeley.com/datasets/3zggb53m4x/2
ev
Direct URL to data: https://data.mendeley.com/datasets/gp8r347j38/2
r
Data identification number: 10.17632/mcmf627yh5.2
Direct URL to data: https://data.mendeley.com/datasets/mcmf627yh5/2
er
Repository name: ITC-Net-Blend-60- Scenario E
Data identification number: 10.17632/gdtnnfyr7s.2
pe
Direct URL to data: https://data.mendeley.com/datasets/gdtnnfyr7s/2
Related research article A. R. Khesal and M. Teimouri, “The Effect of Network Environment on
tn
Traffic Classification,” in proc ICCKE 2022, 17-18 Nov. 2022, Mashhad, pp.
59-64. DOI: 10.1109/ICCKE57176.2022.9960138.
rin
One of the key challenges in this field is developing robust models that function appropriately in
real network environments [1, 2]. As indicated in Table A1 of Appendix, existing datasets are
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
mainly obtained from a single invariant network environment, which cannot properly evaluate
ed
model robustness. Our dataset was captured across five different network scenarios, with various
factors that can affect network traffic behavior [3, 4]. Furthermore, this data was generated
through real human interactions on actual smartphones and captured using a non-rooted
method. This makes the data more representative and suitable for mobile app traffic analysis.
iew
Who can benefit from these data?
This data can provide valuable insights to those interested in improving network performance and
understanding mobile app users’ behavior. Some of the potential beneficiaries are Internet
service providers (ISPs), network administrators, marketing companies, and security agencies [1,
ev
3, 5].
How can these data be used for further insights and development of experiments?
r
This large and diverse dataset can be used to develop and evaluate a wide range of network traffic
analysis tools. Given that the data are released in raw format (PCAP files), researchers have the
Importantly, the various network scenarios presented in the dataset can be used to evaluate
pe
model robustness and generalization. We recommend training models on four scenarios and
testing them on the remaining one, repeating this process for all permutations and reporting the
results using the formula described in reference [6]. This cross-validation process will yield
valuable insights into how well models generalize to different network conditions.
ot
Data
tn
This dataset is organized into separate repositories for each scenario. Each repository includes a dedicated
compressed file for every application. These compressed files contain the corresponding PCAP files. A
visual representation of the dataset structure is shown in Figure 1. Consistency in file naming conventions
rin
has been maintained across the entire dataset. All PCAP files have been named using the following format:
{Application Name}_{Scenario ID}_{#Trace}_Final. pcap.
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Dataset:
ed
|- Scenario A
iew
|--- App1_A_2_Final.pcap
|--- ...
+-- ...
ev
+- Scenario B
+- Scenario C
+- Scenario D
+- Scenario E
r
Figure 1. structure of the dataset
er
The whole dataset consists of 1,163 PCAP traces capturing 37 GB of network traffic data. Table 1 provides
pe
more details about the dataset and each scenario (the code and documents are available in the
supplementary material).
Deficiency
ID number Files Packets Flows* Duration (h)
A 59 233 13,823,338 108,370 27.94 Whatsapp Business
tn
It is important to note that human errors during the data collection process led to the corruption of some
Pr
application files. As a result, only 51 of the initial 60 applications remained consistent across all scenarios.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
We have decided to preserve and publish these non-shared applications because they have utility for
ed
other purposes, except for the robustness evaluation. This set of applications can be found in a folder
named "non-shared" within the repositories.
iew
Experimental Design, Materials, and Methods
The methodology employed for collecting the dataset comprised three main stages: Application Selection,
Traffic Capture Setup, and Traffic Generation. In the application selection phase, we chose the applications
to monitor. Next, in the traffic capture setup phase, we set up the framework to capture network traffic
ev
data. Finally, in the traffic generation phase, we generated the actual network traffic data from the
selected applications. Details on each phase will be provided in subsequent sections.
Application Selection
r
Since it is not feasible to capture the traffic of all applications, we considered the top 300 free Android
apps listed in October 2021 in the Google Play Store and two major Iranian Android app markets, Cafe
er
Bazaar1, and Myket2. From these, we selected a subset of 60 applications based on two criteria:
The chosen applications spanned 16 different categories. The complete list of the selected 60 applications
and additional information can be found in Table A2 in Appendix A.
ot
As depicted in Figure 2, our traffic capture setup included a smartphone and a laptop. We used a laptop
running Windows 10 with an internal dual-band network card and installed Wireshark software on it. The
laptop was connected to the internet and shared its connection with the smartphone via a hotspot. Then
we configured Wireshark to capture traffic through the "Local Area Connection" interface. This setup
rin
enabled the smartphone to access the internet through the laptop's connection, allowing Wireshark to
capture the smartphone's network traffic.
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
ed
iew
r ev
er
pe
ot
The traffic captured by Wireshark contained significant background traffic [7]. To isolate the target
application's network traffic, we installed PCAPdroid on the smartphone. Since root access could modify
application behavior, we used PCAPdroid in non-root mode. In this mode, PCAPdroid does not use a
rin
remote VPN server; instead, it simulates a VPN to capture the network traffic and processes data locally
on the device.
While PCAPdroid can capture an individual application's traffic, it modifies network layers 3 and 4 of
ep
packets, preventing its independent use. Therefore, we used Wireshark and PCAPdroid simultaneously to
record the target application's traffic. Wireshark provided an unaltered packet capture, while PCAPdroid
isolated traffic originating from the application. The two tools complemented each other to provide an
accurate capture of the traffic.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Unfortunately, several applications had restrictions on accessing their servers in Iran, which required us
ed
to use a VPN connection to execute them properly. These applications are identified in Table A2. For this
purpose, we installed the free version of ProtonVPN on the laptop and configured its protocol to
OpenVPN-UDP while setting the VPN network driver to the TAP adapter. To ensure that recorded packets
were not modified by the VPN, we captured traffic on the "local area connection" interface before
iew
entering the VPN connection (Figure 3). As a result, all application traffic was accurately recorded,
including traffic generated by applications that require a VPN connection.
r ev
er
pe
After collecting traffic data, we separated the target application traffic from the background traffic
tn
through a pair-wise comparison of IP addresses and ports captured by Wireshark and PCAPdroid.
Specifically, for each trace, we compared each pair of IP addresses and ports captured by Wireshark with
all pairs captured by PCAPdroid. Any pairs in Wireshark that did not match a PCAPdroid pair were
identified as background traffic and were removed from the Wireshark data. In this way, we eliminated
rin
any irrelevant traffic and obtained ground truth without requiring root privileges on mobile phones.
We implemented this method in Python 3 using the Scapy library. The code for this implementation is
available in the Supplementary material.
ep
Traffic Generation
The traffic generation process was carried out by five volunteers from ITCLAB over six weeks, from
Pr
October to December 2021. Each volunteer collected traffic from a different network Scenario, which is
outlined in Table 2.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
ed
Device
Scenario
User Android Location* ISP*
ID Vendor Model
version
A U1 Xiaomi Note10 Pro 11 L1, L2 N1, N2
iew
B U2 Samsung A50 11 L1, L3 N1, N3
A31 11 L4 N2, N4
C U3 Samsung
Tab A7 Lite 11 L4 N2, N4
N1, N2,
D U4 Samsung J7 Prime 2 9 L1, L2, L5
N5
ev
J7 6.0.1 L6 N2, N6
E U5 Samsung
A12 11 L6 N2, N6
* L1 = ITC Lab L2 = District 5, Tehran L3 = District 11, Tehran L4 =Qom L5 = Karaj L6 = District 8, Tehran
N1 = University of Tehran N2 = TCI N3 =AsiaTech N4 =NTC N5 = Shatel N6 = MCI
r
Table 2. Scenarios Specifications
er
The ethical principles were strictly adhered to, and the volunteers were well-informed about the
pe
objectives of traffic capture and the public release of data in Pcap format. Additionally, they were allowed
to use their personal devices and had complete discretion over sharing their data.
Before commencing the data collection process, the volunteers received training on how to collect traffic.
Each volunteer was required to conduct at least three experiments for every application, with each
experiment consisting of interacting with a single app on a specific smartphone for 3 to 15 minutes.
ot
The volunteers were instructed to use the application as they normally would, to explore its
functionalities. For applications that required login, they were given the option to either create new
tn
Acknowledgments
We wish to express our deepest gratitude to Mohammad Reza Tajzad for his invaluable insights and
ep
expertise, which significantly contributed to the success of our research. We also thank Fatemeh Delroba
for her assistance in the data collection process. Additionally, we would like to extend our heartfelt thanks
to Parastoo Soori for her helpful comment and suggestions on an earlier version of this manuscript, which
greatly improved it. Finally, we sincerely thank all the participants in this study for their time and
Pr
willingness to share their experiences. Without their contributions, our study would not have been
possible.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
References
ed
[1] T. van Ede et al., "FlowPrint: Semi-supervised mobile-app fingerprinting on encrypted network
traffic," in Network and Distributed System Security Symposium (NDSS), 2020, vol. 27.
iew
[2] [dataset] W. Li, X.-Y. Zhang, H. Bao, H. Shi, and Q. Wang, "ProGraph: Robust Network Traffic
Identification With Graph Propagation," IEEE/ACM Transactions on Networking, 2022.
[3] [dataset] V. F. Taylor, R. Spolaor, M. Conti, and I. Martinovic, "Robust smartphone app
identification via encrypted network traffic analysis," IEEE Transactions on Information Forensics
and Security, vol. 13, no. 1, pp. 63-78, 2017.
ev
[4] H. F. Alan and J. Kaur, "Can Android applications be identified using only TCP/IP headers of their
launch time traffic?," in Proceedings of the 9th ACM conference on security & privacy in wireless
and mobile networks, 2016, pp. 61-66.
r
[5] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, "Mobile encrypted traffic classification using
deep learning: Experimental evaluation, lessons learned, and challenges," IEEE Transactions on
Network and Service Management, vol. 16, no. 2, pp. 445-458, 2019.
[6] er
X. Gui, Y. Cao, I. You, L. Ji, Y. Luo, and Z. Luo, "A Survey of techniques for fine-grained web traffic
identification and classification," Mathematical Biosciences and Engineering, vol. 19, no. 3, pp.
pe
2996-3021, 2022.
[7] T. Stöber, M. Frank, J. Schmitt, and I. Martinovic, "Who do you sync you are? smartphone
fingerprinting via application behaviour," in Proceedings of the sixth ACM conference on Security
and privacy in wireless and mobile networks, 2013, pp. 7-12.
ot
[8] [dataset] J. S. Rojas, Á. R. Gallón, and J. C. Corrales, "Personalized service degradation policies on
OTT applications based on the consumption behavior of users," in Computational Science and Its
Applications–ICCSA 2018: 18th International Conference, Melbourne, VIC, Australia, July 2–5, 2018,
Proceedings, Part III 18, 2018: Springer, pp. 543-557.
tn
[9] [dataset] R. Wang, Z. Liu, Y. Cai, D. Tang, J. Yang, and Z. Yang, "Benchmark data for mobile app
traffic research," in Proceedings of the 15th EAI International Conference on Mobile and
Ubiquitous Systems: Computing, Networking and Services, 2018, pp. 402-411.
rin
[11] [dataset] G. Aceto, D. Ciuonzo, A. Montieri, V. Persico, and A. Pescapé, "MIRAGE: Mobile-app
traffic capture and ground-truth creation," in 2019 4th International Conference on Computing,
Communications and Security (ICCCS), 2019: IEEE, pp. 1-8.
Pr
[12] [dataset] J. Ren, D. Dubois, and D. Choffnes, "An International View of Privacy Risks for Mobile
Apps," ed, 2019.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
[13] [dataset] Y. Heng, V. Chandrasekhar, and J. G. Andrews, "UTMobileNetTraffic2021: A Labeled
ed
Public Network Traffic Dataset," IEEE Networking Letters, vol. 3, no. 3, pp. 156-160, 2021.
iew
[15] ProtonVPN.Available online: https://protonvpn.com/(accessed on 21 May 2023).
r ev
er
pe
ot
tn
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Appendix A
ed
This section includes two tables A1 and A2, which are related to the comparison of the available datasets and
the list of selected applications, respectively.
iew
Mobile No. Real Capture No. Network No. Shared
No. No. Released
Dataset Apps Human Capture Span Session Environments Apps across
Apps Device Data
Traffic Users Duratin / Scenarios Scenarios
Traffic
Unicauca [8] 78 Six days in 2017 1
features
Traffic
Mobilegt [9] 12 10 October, 2016 - March, 2017 16 min 1
features
ev
- Raw (Pcap
Andrubis [10] 1M 2012.06.13 - 2016.03.25 1
(simulation) files)
5 - 10 Traffic
Mirage [11] 40 280 1 May 2017 - May 2019 1
min features
Raw (Pcap
Cross Market [12] 400 2017.08.28 - 2017.11.20 3 10
files)
r
Raw (Pcap
CrossNet2021 [2] 20 2 20
files)
- Traffic
Appscanner [3] 110 2 30 min 8 65
UTMobileNet2021[13]
ITC-Net-Blend-60
16
60
(simulation)
-
(simulation)
5
3
7
er
October - December 2021
3 - 15
min
features
Raw (Pcap
files)
Raw (Pcap
files)
3
(not itemized)
5
16
52
pe
*The cells that are left blank are due to the authors not providing any information in those specific areas.
Applications
VPN
Category Name Package Name Metadata
Requirement
tn
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706
Duolingo com.duolingo Link
ed
Memrise com.memrise.android.memrisecompanion Link
Coursera org.coursera.android Link
Telewebion net.telewebion Link
Youtube com.google.android.youtube Link
Entertainment
Aparat com.aparat Link
iew
Filimo com.aparat.filimo Link
Ewano com.ebcom.ewano Link
Finance
AP - Asan Pardakht com.sibche.aspardproject.app Link
Clash of Clans com.supercell.clashofclans Link
Football Strike com.miniclip.footballstrike Link
Mencherz com.incyteltech.mencherz Link
ev
Game
Quiz Of Kings co.palang.QuizOfKingss Link
com.herocraft.game.mafioso.gangster.paradise.
Mafioso Link
pvp
Lifestyle Pinterest com.pinterest Link
r
Snapp cab.snapp.passenger.play Link
Balad com.baladmaps Link
Neshan org.rajman.neshan.traffic.tehran.navigator Link
Maps & navigation
Tapsi
Google Maps
Waze
er taxi.tapsi.passenger
com.google.android.apps.maps
com.waze
Link
Link
Link
pe
Radio Javan com.radiojavan.androidradio Link
Music & Audio Castbox fm.castbox.audiobook.radio.podcast Link
Spotify com.spotify.music Link
Facelab com.lyrebirdstudio.facelab Link
Photography
ToonMe com.vicman.toonmeapp Link
Dropbox com.dropbox.android Link
ot
Productivity
OneDrive com.microsoft.skydrive Link
Shopping Divar ir.divar Link
Digikala com.digikala.diagon Link
com.sheypoor.mobile
tn
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4661706