Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

ip2text: A Reasoning-Aware Dataset for Text Generation of Devices on the Internet

  • Conference paper
  • First Online:
Database Systems for Advanced Applications. DASFAA 2023 International Workshops (DASFAA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13922))

Included in the following conference series:

  • 503 Accesses

Abstract

Nowadays, Internet of Things (IoT) search engines are more and more popular for users to explore devices on the Internet. Table-to-text generation of devices is helpful for users to understand search results from IoT search engines. However, it has yet to be available, and difficult to obtain a good text description of the devices because of lacking quality data for this task. Also, the content is hidden in multiple attributes of the devices, and it takes work to mine them well and directly. Thus, this paper introduces ip2text, a challenging dataset for reasoning-aware table-to-text generation of devices on the Internet. The input data in ip2text are tables, which contain many attributes of devices collected from the Internet. And the output data is their corresponding descriptions. Generating descriptions of devices is costly, time-consuming, and does not scale to Internet data. To tackle this problem, this paper designs an annotation method based on active learning according to the characteristics of devices and studies the performance of existing and typical state-of-the-art models for table-to-text generation on ip2text. The automatic evaluation shows that existing pre-trained baselines could be challenging to perform satisfactorily on ip2text, with BLEU almost all less than 1. Further, the human evaluation shows that BART and T5 are prone to produce hallucinations when reasoning, and results show that Hallucination is more than 0.10. Therefore, it is not easy to achieve satisfactory performance using the existing and mainstream seq2seq models based on the reasoning-aware ip2text. So, continuous improvement is urgently needed for the models and datasets for the table-to-text generation of devices on the Internet.

Supported by National Key Research and Development Projects (No. 2020YFB2103803) and National Natural Science Foundation of China (No. U1766215, No. 61931019).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Matherly, J.: Complete guide to Shodan. Shodan, LLC (2016–02-25), vol. 1 (2015)

    Google Scholar 

  2. Li, R., Shen, M., Yu, H., Li, C., Duan, P., Zhu, L.: A survey on cyberspace search engines. In: Lu, W., et al. (eds.) CNCERT 2020. CCIS, vol. 1299, pp. 206–214. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4922-3_15

    Chapter  Google Scholar 

  3. Ackley, D., Yang, H.: Exploration of smart grid device cybersecurity vulnerability using Shodan. In: 2020 IEEE Power & Energy Society General Meeting (PESGM) (2020)

    Google Scholar 

  4. Novianto, B., Suryanto, Y., Ramli, K.: Vulnerability analysis of internet devices from Indonesia based on exposure data in Shodan. In: IOP Conference Series: Materials Science and Engineering, vol. 1115, no. 1, p. 012045 (9pp) (2021)

    Google Scholar 

  5. Belz, A.: Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Nat. Lang. Eng. 14(4), 431–455 (2008)

    Article  Google Scholar 

  6. Chen, D.L., Mooney, R.J.: Learning to sportscast: a test of grounded language acquisition. In: Proceedings of the 25th International Conference on Machine Learning, pp. 128–135 (2008)

    Google Scholar 

  7. Dušek, O., Novikova, J., Rieser, V.: Evaluating the state-of-the-art of end-to-end natural language generation: the E2E NLG challenge. Comput. Speech Lang. 59, 123–156 (2020)

    Article  Google Scholar 

  8. Lebret, R, Grangier, D., Auli, M.: Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771 (2016)

  9. Moosavi, N.S., Rücklé, A., Roth, D., Gurevych, I.: SciGen: a dataset for reasoning-aware text generation from scientific tables. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)

    Google Scholar 

  10. Wiseman, S., Shieber, S.M., Rush, A.M.: Challenges in data-to-document generation. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017)

    Google Scholar 

  11. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Meeting of the Association for Computational Linguistics (2020)

    Google Scholar 

  12. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  13. Liu, T., Wang, K., Sha, L., Chang, B, Sui, Z.: Table-to-text generation by structure-aware seq2seq learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  14. Nan, L., et al.: DART: open-domain structured data record to text generation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 432–447. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  15. Liu, T., Luo, F., Xia, Q., Ma, S., Chang, B., Sui, Z.: Hierarchical encoder with auxiliary supervision for neural table-to-text generation: learning better representation for tables. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 6786–6793 (2019)

    Google Scholar 

  16. Wang, Q., et al.: Describing a knowledge base. arXiv preprint arXiv:1809.01797 (2018)

  17. Chen, Z., et al.: Logic2text: high-fidelity natural language generation from logical forms. arXiv preprint arXiv:2004.14579 (2020)

  18. Chen, W., et al.: TabFact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164 (2019)

  19. Parikh, A.P., et al.: ToTTo: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373 (2020)

  20. Luo, Y., Chen, X., Ge, N., Lu, J.: Deep learning based device classification method for safeguarding internet of things. In: 2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2021)

    Google Scholar 

  21. Wan, Y., Xu, K., Wang, F., Xue, G.: IoTMosaic: inferring user activities from IoT network traffic in smart homes. In: IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp. 370–379. IEEE (2022)

    Google Scholar 

  22. Wang, Y., Burgener, D., Flores, M., Kuzmanovic, A., Huang, C.: Towards street-level IP geolocation. In: 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2011) (2011)

    Google Scholar 

  23. Paiva, T.B., Siqueira, Y., Batista, D.M., Hirata, R., Terada, R.: BGP anomalies classification using features based on as relationship graphs. In: 2021 IEEE Latin-American Conference on Communications (LATINCOM), pp. 1–6. IEEE (2021)

    Google Scholar 

  24. Lu, C., et al.: From WHOIS to WHOWAS: a large-scale measurement study of domain registration privacy under the GDPR. In: NDSS (2021)

    Google Scholar 

  25. Fiebig, T., Borgolte, K., Hao, S., Kruegel, C., Vigna, G., Feldmann, A.: In rDNS we trust: revisiting a common data-source’s reliability. In: Beverly, R., Smaragdakis, G., Feldmann, A. (eds.) PAM 2018. LNCS, vol. 10771, pp. 131–145. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76481-8_10

    Chapter  Google Scholar 

  26. Ye, R., Shi, W., Zhou, H., Wei, Z., Li, L.: Variational template machine for data-to-text generation. arXiv preprint arXiv:2002.01127 (2020)

  27. Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. (CSUR) 54(9), 1–40 (2021)

    Article  Google Scholar 

  28. Alihosseini, D., Montahaei, E., Baghshah, M.S.: Jointly measuring diversity and quality in text generation models. In: Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 90–98 (2019)

    Google Scholar 

  29. Voita, E., Sennrich, R., Titov, I.: When a good translation is wrong in context: context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1198–1212. Association for Computational Linguistics (2019)

    Google Scholar 

  30. Ribeiro, L.F.R., Schmitt, M, Schütze, H., Gurevych, I.: Investigating pretrained language models for graph-to-text generation. arXiv preprint arXiv:2007.08426 (2020)

  31. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  32. Post, M.: A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771 (2018)

  33. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  34. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  35. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)

  36. Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S.: MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622 (2019)

  37. Sellam, T., Das, D., Parikh, A.P.: BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ren, Y. et al. (2023). ip2text: A Reasoning-Aware Dataset for Text Generation of Devices on the Internet. In: El Abbadi, A., et al. Database Systems for Advanced Applications. DASFAA 2023 International Workshops. DASFAA 2023. Lecture Notes in Computer Science, vol 13922. Springer, Cham. https://doi.org/10.1007/978-3-031-35415-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35415-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35414-4

  • Online ISBN: 978-3-031-35415-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics