Abstract
Nowadays, Internet of Things (IoT) search engines are more and more popular for users to explore devices on the Internet. Table-to-text generation of devices is helpful for users to understand search results from IoT search engines. However, it has yet to be available, and difficult to obtain a good text description of the devices because of lacking quality data for this task. Also, the content is hidden in multiple attributes of the devices, and it takes work to mine them well and directly. Thus, this paper introduces ip2text, a challenging dataset for reasoning-aware table-to-text generation of devices on the Internet. The input data in ip2text are tables, which contain many attributes of devices collected from the Internet. And the output data is their corresponding descriptions. Generating descriptions of devices is costly, time-consuming, and does not scale to Internet data. To tackle this problem, this paper designs an annotation method based on active learning according to the characteristics of devices and studies the performance of existing and typical state-of-the-art models for table-to-text generation on ip2text. The automatic evaluation shows that existing pre-trained baselines could be challenging to perform satisfactorily on ip2text, with BLEU almost all less than 1. Further, the human evaluation shows that BART and T5 are prone to produce hallucinations when reasoning, and results show that Hallucination is more than 0.10. Therefore, it is not easy to achieve satisfactory performance using the existing and mainstream seq2seq models based on the reasoning-aware ip2text. So, continuous improvement is urgently needed for the models and datasets for the table-to-text generation of devices on the Internet.
Supported by National Key Research and Development Projects (No. 2020YFB2103803) and National Natural Science Foundation of China (No. U1766215, No. 61931019).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Matherly, J.: Complete guide to Shodan. Shodan, LLC (2016–02-25), vol. 1 (2015)
Li, R., Shen, M., Yu, H., Li, C., Duan, P., Zhu, L.: A survey on cyberspace search engines. In: Lu, W., et al. (eds.) CNCERT 2020. CCIS, vol. 1299, pp. 206–214. Springer, Singapore (2020). https://doi.org/10.1007/978-981-33-4922-3_15
Ackley, D., Yang, H.: Exploration of smart grid device cybersecurity vulnerability using Shodan. In: 2020 IEEE Power & Energy Society General Meeting (PESGM) (2020)
Novianto, B., Suryanto, Y., Ramli, K.: Vulnerability analysis of internet devices from Indonesia based on exposure data in Shodan. In: IOP Conference Series: Materials Science and Engineering, vol. 1115, no. 1, p. 012045 (9pp) (2021)
Belz, A.: Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Nat. Lang. Eng. 14(4), 431–455 (2008)
Chen, D.L., Mooney, R.J.: Learning to sportscast: a test of grounded language acquisition. In: Proceedings of the 25th International Conference on Machine Learning, pp. 128–135 (2008)
Dušek, O., Novikova, J., Rieser, V.: Evaluating the state-of-the-art of end-to-end natural language generation: the E2E NLG challenge. Comput. Speech Lang. 59, 123–156 (2020)
Lebret, R, Grangier, D., Auli, M.: Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771 (2016)
Moosavi, N.S., Rücklé, A., Roth, D., Gurevych, I.: SciGen: a dataset for reasoning-aware text generation from scientific tables. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Wiseman, S., Shieber, S.M., Rush, A.M.: Challenges in data-to-document generation. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Meeting of the Association for Computational Linguistics (2020)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Liu, T., Wang, K., Sha, L., Chang, B, Sui, Z.: Table-to-text generation by structure-aware seq2seq learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Nan, L., et al.: DART: open-domain structured data record to text generation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 432–447. Association for Computational Linguistics, Online (2021)
Liu, T., Luo, F., Xia, Q., Ma, S., Chang, B., Sui, Z.: Hierarchical encoder with auxiliary supervision for neural table-to-text generation: learning better representation for tables. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 6786–6793 (2019)
Wang, Q., et al.: Describing a knowledge base. arXiv preprint arXiv:1809.01797 (2018)
Chen, Z., et al.: Logic2text: high-fidelity natural language generation from logical forms. arXiv preprint arXiv:2004.14579 (2020)
Chen, W., et al.: TabFact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164 (2019)
Parikh, A.P., et al.: ToTTo: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373 (2020)
Luo, Y., Chen, X., Ge, N., Lu, J.: Deep learning based device classification method for safeguarding internet of things. In: 2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2021)
Wan, Y., Xu, K., Wang, F., Xue, G.: IoTMosaic: inferring user activities from IoT network traffic in smart homes. In: IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp. 370–379. IEEE (2022)
Wang, Y., Burgener, D., Flores, M., Kuzmanovic, A., Huang, C.: Towards street-level IP geolocation. In: 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2011) (2011)
Paiva, T.B., Siqueira, Y., Batista, D.M., Hirata, R., Terada, R.: BGP anomalies classification using features based on as relationship graphs. In: 2021 IEEE Latin-American Conference on Communications (LATINCOM), pp. 1–6. IEEE (2021)
Lu, C., et al.: From WHOIS to WHOWAS: a large-scale measurement study of domain registration privacy under the GDPR. In: NDSS (2021)
Fiebig, T., Borgolte, K., Hao, S., Kruegel, C., Vigna, G., Feldmann, A.: In rDNS we trust: revisiting a common data-source’s reliability. In: Beverly, R., Smaragdakis, G., Feldmann, A. (eds.) PAM 2018. LNCS, vol. 10771, pp. 131–145. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76481-8_10
Ye, R., Shi, W., Zhou, H., Wei, Z., Li, L.: Variational template machine for data-to-text generation. arXiv preprint arXiv:2002.01127 (2020)
Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. (CSUR) 54(9), 1–40 (2021)
Alihosseini, D., Montahaei, E., Baghshah, M.S.: Jointly measuring diversity and quality in text generation models. In: Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 90–98 (2019)
Voita, E., Sennrich, R., Titov, I.: When a good translation is wrong in context: context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1198–1212. Association for Computational Linguistics (2019)
Ribeiro, L.F.R., Schmitt, M, Schütze, H., Gurevych, I.: Investigating pretrained language models for graph-to-text generation. arXiv preprint arXiv:2007.08426 (2020)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Post, M.: A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771 (2018)
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)
Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S.: MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622 (2019)
Sellam, T., Das, D., Parikh, A.P.: BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ren, Y. et al. (2023). ip2text: A Reasoning-Aware Dataset for Text Generation of Devices on the Internet. In: El Abbadi, A., et al. Database Systems for Advanced Applications. DASFAA 2023 International Workshops. DASFAA 2023. Lecture Notes in Computer Science, vol 13922. Springer, Cham. https://doi.org/10.1007/978-3-031-35415-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-35415-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35414-4
Online ISBN: 978-3-031-35415-1
eBook Packages: Computer ScienceComputer Science (R0)