Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3405671.3405812acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Challenges in Using ML for Networking Research: How to Label If You Must

Published: 10 August 2020 Publication History

Abstract

Leveraging innovations in Machine Learning (ML) research is of great current interest to researchers across the sciences, including networking research. However, using ML for networking poses challenging new problems that have been responsible for slowing the pace of innovation and the adoption of ML in the networking domain. Among the main problems are a well-known lack of data in general and representative data in particular, an overall inability to label data at scale, unknown data quality due to differences in data collection strategies, and data privacy issues that are unique to network data. Motivated by these challenges, we describe the design of Emerge1, a novel framework to support efforts to dEmocratize the use of ML for nEtwoRkinG rEsearch. In particular, Emerge focuses on the problem of providing a low-cost, scalable, and high-quality methodology for labeling networking data. To illustrate the benefits of Emerge, we use publicly available network measurement datasets from Caida's Ark project and create and evaluate data labels for them in a programmable fashion.

References

[1]
Analyzing preventing unconscious bias in machine learning. https://www.mfoq.com/presentations/unconscious-bias-machine-learning/.
[2]
Argoverse. https://www.argoverse.org/.
[3]
CAIDA Ark Datasets. http://www.caida.org/projects/ark/topo_datasets.xml.
[4]
CRAWDAD Datasets. https://crawdad.org/.
[5]
If data is the new oil, these companies are the new baker hughes. https://fortune.com/2020/02/04/artificial- intelligence- data- labeling- labelbox/.
[6]
nuscenes. https://www.nuscenes.org/.
[7]
Privacy-preserving machine learning 2018: A year in review. https://medium.com/dropoutlabs/privacy-preserving-machine-learning- 2018-a-year-in-review-b6345a95ae0f.
[8]
RIPE Atlas. https://atlas.ripe.net, 2018.
[9]
Sc_warts2csv, Mar. 2018.
[10]
Theipv4routed/24topologydataset, Nov. 2019.
[11]
Baxter, J. A model of inductive bias learning. Journal of artificial intelligence research 12 (2000), 149--198.
[12]
Bhumiratana, B., and Bishop, M. Privacy aware data sharing: balancing the usability and privacy of datasets. In Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments (2009), pp. 1--8.
[13]
Bunescu, R., and Mooney, R. Learning to extract relations from the web using minimal supervision. In ACL (2007).
[14]
Camacho, J., Pérez-Villegas, A., García-Teodoro, P., and Maciá-Fernández, G. PCA-based Multivariate Statistical Network Monitoring for Anomaly Detection. Computers & Security (2016).
[15]
Caruana, R. Multitask learning. Machine learning 28, 1 (1997), 41--75.
[16]
Christ, M., Braun, N., Neuffer, J., and Kempa-Liehr, A. W. Time series feature extraction on basis of scalable hypothesis tests (tsfresh - a python package). Neurocomputing 307 (2018), 72 - 77.
[17]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009), Ieee, pp. 248--255.
[18]
Dunning, L. A., and Kresman, R. Privacy preserving data sharing with anonymous id assignment. IEEE Transactions on Information Forensics and Security 8, 2 (2012), 402--413.
[19]
El Emam, K., Jonker, E., and Luk Arbuckle, B. M. A systematic review of re-identification attacks on health data. PloS one 6, 12 (2011).
[20]
Feamster, N., and Rexford, J. Why (and how) networks should run themselves. arXiv preprint arXiv: 1710.11583 (2017).
[21]
Forde, J., Bussonnier, M., Fortin, F.-A., Granger, B., Head, T., Holdgraf, C., Ivanov, P., Kelley, K., Pacer, M., Panda, Y., et al. Reproducing machine learning research on binder. In NIPS Workshop on Machine Learning Open Source Software (2018).
[22]
Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (2015), pp. 1322--1333.
[23]
Gupta, A., Mac-Stoker, C., and Willinger, W. An effort to democratize networking research in the era of ai/ml. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks (2019), pp. 93--100.
[24]
Janai, J., Güney, F., Behl, A., and Geiger, A. Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art. arXiv e-prints (2017).
[25]
K. Xu and J. Chandrashekar and Z.L. Zhang. A First Step toward Understanding Inter-domain Routing Dynamics. In ACM SIGCOMM workshop on Mining network data (2005).
[26]
Lakhina, A., Crovella, M., and Diot, C. Diagnosing Network-wide Traffic Anomalies. In ACM SIGCOMM (2004).
[27]
Lakhina, A., Crovella, M., and Diot, C. Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM (2005).
[28]
Li, X., Bian, F., Zhang, H., Diot, C., Govindan, R., Hong, W., and Iannaccone, G. MIND: A Distributed Multi-Dimensional Indexing System for Network Diagnosis. In IEEE INFOCOM (2006).
[29]
Liu, T., Alibhai, S., Wang, J., Liu, Q., He, X., and Wu, C. Exploring transfer learning to reduce training overhead of hpc data in machine learning. In 2019 IEEE International Conference on Networking, Architecture and Storage (NAS) (2019), IEEE, pp. 1--7.
[30]
Luckie, M. Scamper: a scalable and extensible packet prober for active measurement of the internet. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement (2010), pp. 239--245.
[31]
Moore, A. W., and Zuev, D. Internet traffic classification using bayesian analysis techniques. In ACM SIGMETRICS (2005).
[32]
Muthukumar, A., and Durairajan, R. Denoising internet delay measurements using weak supervision. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) (2019), IEEE, pp. 479--484.
[33]
Nguyen, T. T., and Armitage, G. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10, 4, 56--76.
[34]
Nixon, M., and Aguado, A. Feature extraction and image processing for computer vision. Academic Press, 2019.
[35]
Rafique, D., and Velasco, L. Machine learning for network automation: Overview, architecture, and applications [invited tutorial]. Journal of Optical Communications and Networking 10, 10 (2018), D126-D143.
[36]
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. Snorkel: Rapid training data creation with weak supervision. VLDB Endowment (2017).
[37]
Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré, C. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems (2016), pp. 3567--3575.
[38]
Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. Holoclean: Holistic data repairs with probabilistic inference. VLDB Endowment (2017).
[39]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206--215.
[40]
Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP) (2017), IEEE, pp. 3--18.
[41]
Sommers, J., Durairajan, R., and Barford, P. Automatic metadata generation for active measurement. In ACM IMC (2017).
[42]
Syamkumar, M., Mani, S. K., Durairajan, R., Barford, P., and Sommers, J. Wrinkles in Time: Detecting Internet-wide Events via NTP. In proceedings of IFIP Networking (2018).
[43]
Usama, M., Qadir, J., Raza, A., Arif, H., Yau, K.-L. A., Elkhatib, Y., Hussain, A., and Al-Fuqaha, A. Unsupervised machine learning for networking: Techniques, applications and research challenges. IEEE Access 7 (2019), 65579--65615.
[44]
Vaidya, A., Mai, F., and Ning, Y. Empirical analysis of multi-task learning for reducing model bias in toxic comment detection. arXiv preprint arXiv:1909.09758 (2019).
[45]
Varma, P., and Ré, C. Snuba: Automating weak supervision to label training data. Proc. VLDB Endow. 12, 3 (Nov. 2018), 223--236.
[46]
Weidmann, N. B., Benitez-Baleato, S., Hunziker, P., Glatz, E., and Dimitropoulos, X. Digital discrimination: Political bias in internet service provision across ethnic groups. Science 353, 6304 (2016), 1151--1155.
[47]
Williams, N., Zander, S., and Armitage, G. A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification. ACM SIGCOMM CCR (2006).
[48]
Yuen, M.-C., King, I., and Leung, K.-S. A survey of crowdsourcing systems. In IEEE SocialCom (2011).
[49]
Zerwas, J., Kalmbach, P., Henkel, L., Rétvári, G., Kellerer, W., Blenk, A., and Schmid, S. Netboa: Self-driving network benchmarking. In Proceedings of the 2019 Workshop on Network Meets AI & ML (2019), pp. 8--14.
[50]
Zhang, B., Iosup, A., Pouwelse, J., Epema, D., and Sips, H. Sampling bias in bittorrent measurements. In European Conference on Parallel Processing (2010), Springer, pp. 484--496.
[51]
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).

Cited By

View all
  • (2024)Leveraging Prefix Structure to Detect Volumetric DDoS Attack Signatures with Programmable Switches2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00267(4535-4553)Online publication date: 19-May-2024
  • (2022)Learning Model Generalisation for Bot DetectionProceedings of the 2022 European Interdisciplinary Cybersecurity Conference10.1145/3528580.3532841(57-63)Online publication date: 15-Jun-2022
  • (2022)ARISE: A Multitask Weak Supervision Framework for Network MeasurementsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2022.318078340:8(2456-2473)Online publication date: Aug-2022

Index Terms

  1. Challenges in Using ML for Networking Research: How to Label If You Must

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      NetAI '20: Proceedings of the Workshop on Network Meets AI & ML
      August 2020
      66 pages
      ISBN:9781450380430
      DOI:10.1145/3405671
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 August 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Labeling network data at scale
      2. Weak supervision

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      SIGCOMM '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 13 of 38 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)106
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Leveraging Prefix Structure to Detect Volumetric DDoS Attack Signatures with Programmable Switches2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00267(4535-4553)Online publication date: 19-May-2024
      • (2022)Learning Model Generalisation for Bot DetectionProceedings of the 2022 European Interdisciplinary Cybersecurity Conference10.1145/3528580.3532841(57-63)Online publication date: 15-Jun-2022
      • (2022)ARISE: A Multitask Weak Supervision Framework for Network MeasurementsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2022.318078340:8(2456-2473)Online publication date: Aug-2022
      • (2022)Zero Touch Management: A Survey of Network Automation Solutions for 5G and 6G NetworksIEEE Communications Surveys & Tutorials10.1109/COMST.2022.321258624:4(2535-2578)Online publication date: Dec-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media