research-article

Public Access

Challenges in Using ML for Networking Research: How to Label If You Must

Authors:

Ramakrishnan Durairajan,

Walter WillingerAuthors Info & Claims

NetAI '20: Proceedings of the Workshop on Network Meets AI & ML

Pages 21 - 27

https://doi.org/10.1145/3405671.3405812

Published: 10 August 2020 Publication History

Abstract

Leveraging innovations in Machine Learning (ML) research is of great current interest to researchers across the sciences, including networking research. However, using ML for networking poses challenging new problems that have been responsible for slowing the pace of innovation and the adoption of ML in the networking domain. Among the main problems are a well-known lack of data in general and representative data in particular, an overall inability to label data at scale, unknown data quality due to differences in data collection strategies, and data privacy issues that are unique to network data. Motivated by these challenges, we describe the design of Emerge1, a novel framework to support efforts to dEmocratize the use of ML for nEtwoRkinG rEsearch. In particular, Emerge focuses on the problem of providing a low-cost, scalable, and high-quality methodology for labeling networking data. To illustrate the benefits of Emerge, we use publicly available network measurement datasets from Caida's Ark project and create and evaluate data labels for them in a programmable fashion.

References

[1]

Analyzing preventing unconscious bias in machine learning. https://www.mfoq.com/presentations/unconscious-bias-machine-learning/.

[2]

Argoverse. https://www.argoverse.org/.

[3]

CAIDA Ark Datasets. http://www.caida.org/projects/ark/topo_datasets.xml.

[4]

CRAWDAD Datasets. https://crawdad.org/.

[5]

If data is the new oil, these companies are the new baker hughes. https://fortune.com/2020/02/04/artificial- intelligence- data- labeling- labelbox/.

[6]

nuscenes. https://www.nuscenes.org/.

[7]

Privacy-preserving machine learning 2018: A year in review. https://medium.com/dropoutlabs/privacy-preserving-machine-learning- 2018-a-year-in-review-b6345a95ae0f.

[8]

RIPE Atlas. https://atlas.ripe.net, 2018.

[9]

Sc_warts2csv, Mar. 2018.

[10]

Theipv4routed/24topologydataset, Nov. 2019.

[11]

Baxter, J. A model of inductive bias learning. Journal of artificial intelligence research 12 (2000), 149--198.

[12]

Bhumiratana, B., and Bishop, M. Privacy aware data sharing: balancing the usability and privacy of datasets. In Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments (2009), pp. 1--8.

Digital Library

[13]

Bunescu, R., and Mooney, R. Learning to extract relations from the web using minimal supervision. In ACL (2007).

[14]

Camacho, J., Pérez-Villegas, A., García-Teodoro, P., and Maciá-Fernández, G. PCA-based Multivariate Statistical Network Monitoring for Anomaly Detection. Computers & Security (2016).

[15]

Caruana, R. Multitask learning. Machine learning 28, 1 (1997), 41--75.

[16]

Christ, M., Braun, N., Neuffer, J., and Kempa-Liehr, A. W. Time series feature extraction on basis of scalable hypothesis tests (tsfresh - a python package). Neurocomputing 307 (2018), 72 - 77.

Digital Library

[17]

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009), Ieee, pp. 248--255.

[18]

Dunning, L. A., and Kresman, R. Privacy preserving data sharing with anonymous id assignment. IEEE Transactions on Information Forensics and Security 8, 2 (2012), 402--413.

[19]

El Emam, K., Jonker, E., and Luk Arbuckle, B. M. A systematic review of re-identification attacks on health data. PloS one 6, 12 (2011).

[20]

Feamster, N., and Rexford, J. Why (and how) networks should run themselves. arXiv preprint arXiv: 1710.11583 (2017).

[21]

Forde, J., Bussonnier, M., Fortin, F.-A., Granger, B., Head, T., Holdgraf, C., Ivanov, P., Kelley, K., Pacer, M., Panda, Y., et al. Reproducing machine learning research on binder. In NIPS Workshop on Machine Learning Open Source Software (2018).

[22]

Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (2015), pp. 1322--1333.

Digital Library

[23]

Gupta, A., Mac-Stoker, C., and Willinger, W. An effort to democratize networking research in the era of ai/ml. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks (2019), pp. 93--100.

Digital Library

[24]

Janai, J., Güney, F., Behl, A., and Geiger, A. Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art. arXiv e-prints (2017).

[25]

K. Xu and J. Chandrashekar and Z.L. Zhang. A First Step toward Understanding Inter-domain Routing Dynamics. In ACM SIGCOMM workshop on Mining network data (2005).

Digital Library

[26]

Lakhina, A., Crovella, M., and Diot, C. Diagnosing Network-wide Traffic Anomalies. In ACM SIGCOMM (2004).

Digital Library

[27]

Lakhina, A., Crovella, M., and Diot, C. Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM (2005).

[28]

Li, X., Bian, F., Zhang, H., Diot, C., Govindan, R., Hong, W., and Iannaccone, G. MIND: A Distributed Multi-Dimensional Indexing System for Network Diagnosis. In IEEE INFOCOM (2006).

[29]

Liu, T., Alibhai, S., Wang, J., Liu, Q., He, X., and Wu, C. Exploring transfer learning to reduce training overhead of hpc data in machine learning. In 2019 IEEE International Conference on Networking, Architecture and Storage (NAS) (2019), IEEE, pp. 1--7.

[30]

Luckie, M. Scamper: a scalable and extensible packet prober for active measurement of the internet. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement (2010), pp. 239--245.

Digital Library

[31]

Moore, A. W., and Zuev, D. Internet traffic classification using bayesian analysis techniques. In ACM SIGMETRICS (2005).

Digital Library

[32]

Muthukumar, A., and Durairajan, R. Denoising internet delay measurements using weak supervision. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) (2019), IEEE, pp. 479--484.

[33]

Nguyen, T. T., and Armitage, G. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10, 4, 56--76.

[34]

Nixon, M., and Aguado, A. Feature extraction and image processing for computer vision. Academic Press, 2019.

[35]

Rafique, D., and Velasco, L. Machine learning for network automation: Overview, architecture, and applications [invited tutorial]. Journal of Optical Communications and Networking 10, 10 (2018), D126-D143.

[36]

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. Snorkel: Rapid training data creation with weak supervision. VLDB Endowment (2017).

Digital Library

[37]

Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré, C. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems (2016), pp. 3567--3575.

Digital Library

[38]

Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. Holoclean: Holistic data repairs with probabilistic inference. VLDB Endowment (2017).

Digital Library

[39]

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206--215.

[40]

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP) (2017), IEEE, pp. 3--18.

[41]

Sommers, J., Durairajan, R., and Barford, P. Automatic metadata generation for active measurement. In ACM IMC (2017).

Digital Library

[42]

Syamkumar, M., Mani, S. K., Durairajan, R., Barford, P., and Sommers, J. Wrinkles in Time: Detecting Internet-wide Events via NTP. In proceedings of IFIP Networking (2018).

[43]

Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L. A., Elkhatib, Y., Hussain, A., and Al-Fuqaha, A. Unsupervised machine learning for networking: Techniques, applications and research challenges. IEEE Access 7 (2019), 65579--65615.

[44]

Vaidya, A., Mai, F., and Ning, Y. Empirical analysis of multi-task learning for reducing model bias in toxic comment detection. arXiv preprint arXiv:1909.09758 (2019).

[45]

Varma, P., and Ré, C. Snuba: Automating weak supervision to label training data. Proc. VLDB Endow. 12, 3 (Nov. 2018), 223--236.

Digital Library

[46]

Weidmann, N. B., Benitez-Baleato, S., Hunziker, P., Glatz, E., and Dimitropoulos, X. Digital discrimination: Political bias in internet service provision across ethnic groups. Science 353, 6304 (2016), 1151--1155.

[47]

Williams, N., Zander, S., and Armitage, G. A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification. ACM SIGCOMM CCR (2006).

[48]

Yuen, M.-C., King, I., and Leung, K.-S. A survey of crowdsourcing systems. In IEEE SocialCom (2011).

[49]

Zerwas, J., Kalmbach, P., Henkel, L., Rétvári, G., Kellerer, W., Blenk, A., and Schmid, S. Netboa: Self-driving network benchmarking. In Proceedings of the 2019 Workshop on Network Meets AI & ML (2019), pp. 8--14.

Digital Library

[50]

Zhang, B., Iosup, A., Pouwelse, J., Epema, D., and Sips, H. Sampling bias in bittorrent measurements. In European Conference on Parallel Processing (2010), Springer, pp. 484--496.

[51]

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).

Cited By

Elfandi ASagalyn HDurairajan RWillinger W(2024)Bootstrapping Trust in ML4Nets Solutions with Hybrid ExplainabilityProceedings of the 3rd Workshop on Practical Adoption Challenges of ML for Systems10.1145/3704742.3704961(1-5)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3704742.3704961
Misa CDurairajan RGupta ARejaie RWillinger W(2024)Leveraging Prefix Structure to Detect Volumetric DDoS Attack Signatures with Programmable Switches2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00267(4535-4553)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00267
Blaise AMihailescu EVidalenc BAufrechter LMihai DCarabas M(2022)Learning Model Generalisation for Bot DetectionProceedings of the 2022 European Interdisciplinary Cybersecurity Conference10.1145/3528580.3532841(57-63)Online publication date: 15-Jun-2022
https://dl.acm.org/doi/10.1145/3528580.3532841
Show More Cited By

Recommendations

SPL-LDP: a label distribution propagation method for semi-supervised partial label learning
Abstract
Partial label learning learns from examples represented by a single instance while associated with multiple candidate labels, among which only one valid label resides. However, in real-world applications, collecting candidate label sets for all ...
WeLSA: Learning to Predict 6D Pose from Weakly Labeled Data Using Shape Alignment
Computer Vision – ECCV 2022
Abstract
Object pose estimation is a crucial task in computer vision and augmented reality. One of its key challenges is the difficulty of annotation of real training data and the lack of textured CAD models. Therefore, pipelines which do not require CAD ...
Hybrid supervised instance segmentation by learning label noise suppression
Abstract
To reach top accuracy, current fully supervised instance segmentation methods severely rely on large-scale pixel-wise labeled datasets. They are usually expensive and time-consuming to obtain. Though weakly or semi-supervised methods ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

NetAI '20: Proceedings of the Workshop on Network Meets AI & ML

August 2020

66 pages

ISBN:9781450380430

DOI:10.1145/3405671

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

SIGCOMM '20

Sponsor:

SIGCOMM

SIGCOMM '20: Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication

August 10 - 14, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 13 of 38 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
745
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)21

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Elfandi ASagalyn HDurairajan RWillinger W(2024)Bootstrapping Trust in ML4Nets Solutions with Hybrid ExplainabilityProceedings of the 3rd Workshop on Practical Adoption Challenges of ML for Systems10.1145/3704742.3704961(1-5)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3704742.3704961
Misa CDurairajan RGupta ARejaie RWillinger W(2024)Leveraging Prefix Structure to Detect Volumetric DDoS Attack Signatures with Programmable Switches2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00267(4535-4553)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00267
Blaise AMihailescu EVidalenc BAufrechter LMihai DCarabas M(2022)Learning Model Generalisation for Bot DetectionProceedings of the 2022 European Interdisciplinary Cybersecurity Conference10.1145/3528580.3532841(57-63)Online publication date: 15-Jun-2022
https://dl.acm.org/doi/10.1145/3528580.3532841
Knofczynski JDurairajan RWillinger W(2022)ARISE: A Multitask Weak Supervision Framework for Network MeasurementsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2022.318078340:8(2456-2473)Online publication date: Aug-2022
https://doi.org/10.1109/JSAC.2022.3180783
Coronado EBehravesh RSubramanya TFernandez-Fernandez ASiddiqui MCosta-Perez XRiggio R(2022)Zero Touch Management: A Survey of Network Automation Solutions for 5G and 6G NetworksIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321258624:4(2535-2578)Online publication date: Dec-2023
https://doi.org/10.1109/COMST.2022.3212586

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten