Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357384.3357827acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Industry Specific Word Embedding and its Application in Log Classification

Published: 03 November 2019 Publication History

Abstract

Word, sentence and document embeddings have become the cornerstone of most natural language processing-based solutions. The training of an effective embedding depends on a large corpus of relevant documents. However, such corpus is not always available, especially for specialized heavy industries such as oil, mining, or steel. To address the problem, this paper proposes a semi-supervised learning framework to create document corpus and embedding starting from an industry taxonomy, along with a very limited set of relevant positive and negative documents. Our solution organizes candidate documents into a graph and adopts different explore and exploit strategies to iteratively create the corpus and its embedding. At each iteration, two metrics, called Coverage and Context Similarity, are used as proxy to measure the quality of the results. Our experiments demonstrate how an embedding created by our solution is more effective than the one created by processing thousands of industry-specific document pages. We also explore using our embedding in downstream tasks, such as building an industry specific classification model given labeled training data, as well as classifying unlabeled documents according to industry taxonomy terms.

References

[1]
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2--3 (2002), 235--256.
[2]
Pável Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, Berthier Ribeiro-Neto, and Marcos André Gonçalves. 2003. Combining link-based and content-based methods for web document classification. In Proceedings of the twelfth international conference on Information and knowledge management. ACM, 394--401.
[3]
Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt. 2016. Representation Learning for Very Short Texts Using Weighted Word Embedding Aggregation. Pattern Recogn. Lett. 80, C (Sept. 2016), 150--156. https: //doi.org/10.1016/j.patrec.2016.06.012
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 213--220.
[6]
Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, and Philip S Yu. 2006. Text classification without negative examples revisit. IEEE transactions on Knowledge and Data Engineering 18, 1 (2006), 6--20.
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[8]
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 328--339.
[9]
Jonathan Goldsmith. 2014. Wikipedia API for Python. (November 15th 2014). available at: https://pypi.org/project/wikipedia/ (Aug. 2018).
[10]
Vinko Kodſoman. [n. d.]. Pseudo-labeling a simple semi-supervised learning method Pseudo-labeling a simple semi-supervised learning method. ([n. d.]).
[11]
Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3. 2.
[12]
Wee Sun Lee and Bing Liu. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, Vol. 3. 448--455.
[13]
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.html
[14]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In The IEEE International Conference on Computer Vision (ICCV).
[15]
Zhigang Liu, Wenzhong Shi, Deren Li, and Qianqing Qin. 2005. Partially supervised classification--based on weighted unlabeled samples support vector machine. In International Conference on Advanced Data Mining and Applications. Springer, 118--129.
[16]
Larry M Manevitz and Malik Yousef. 2001. One-class SVMs for document classification. Journal of machine Learning research 2, Dec (2001), 139--154.
[17]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[18]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[19]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[20]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
[21]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/ languageunsupervised/language understanding paper. pdf (2018).
[22]
Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/ 884893/en.
[23]
Arpita Roy, Youngja Park, and SHimei Pan. 2017. Learning Domain-SpecificWord Embeddings from Sparse Cybersecurity Texts. arXiv preprint arXiv:1709.07470 (2017).
[24]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[25]
Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 90--94.

Cited By

View all
  • (2024)Natural Language Processing in Knowledge-Based Support for Operator AssistanceApplied Sciences10.3390/app1407276614:7(2766)Online publication date: 26-Mar-2024
  • (2023)SiWareProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/829(7115-7118)Online publication date: 19-Aug-2023
  • (2023)SEmHuS: a semantically embedded humanitarian spaceJournal of International Humanitarian Action10.1186/s41018-023-00135-48:1Online publication date: 7-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
November 2019
3373 pages
ISBN:9781450369763
DOI:10.1145/3357384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. natural language processing
  2. text classification
  3. word embeddings

Qualifiers

  • Research-article

Conference

CIKM '19
Sponsor:

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)8
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Natural Language Processing in Knowledge-Based Support for Operator AssistanceApplied Sciences10.3390/app1407276614:7(2766)Online publication date: 26-Mar-2024
  • (2023)SiWareProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/829(7115-7118)Online publication date: 19-Aug-2023
  • (2023)SEmHuS: a semantically embedded humanitarian spaceJournal of International Humanitarian Action10.1186/s41018-023-00135-48:1Online publication date: 7-Mar-2023
  • (2022)A Custom Word Embedding Model for Clustering of Maintenance RecordsIEEE Transactions on Industrial Informatics10.1109/TII.2021.307952118:2(816-826)Online publication date: Feb-2022
  • (2021)Fast Extraction of Word Embedding from Q-contextsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482343(873-882)Online publication date: 26-Oct-2021
  • (2021)Software Log Classification in Telecommunication Industry2021 6th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK52708.2021.9558985(348-353)Online publication date: 15-Sep-2021
  • (2021)Inferring Multilingual Domain-Specific Word Embeddings From Large Document CorporaIEEE Access10.1109/ACCESS.2021.31180939(137309-137321)Online publication date: 2021
  • (2021)Technological troubleshooting based on sentence embedding with deep transformersJournal of Intelligent Manufacturing10.1007/s10845-021-01797-w32:6(1699-1710)Online publication date: 7-Jun-2021
  • (2021)Machine Learning and Natural Language Processing in Domain Classification of Scientific Knowledge Objects: A ReviewAdvances in Information and Communication10.1007/978-3-030-73103-8_55(773-784)Online publication date: 16-Apr-2021
  • (2020)An Ensemble Learning based Hierarchical Multi-label Classification Approach to Identify Impacts of Engineering Changes2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00190(1260-1267)Online publication date: Nov-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media