research-article

Industry Specific Word Embedding and its Application in Log Classification

Authors:

Wesley M. Gifford,

Bhanukiran Vinzamuri,

Pietro MazzoleniAuthors Info & Claims

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Pages 2713 - 2721

https://doi.org/10.1145/3357384.3357827

Published: 03 November 2019 Publication History

Abstract

Word, sentence and document embeddings have become the cornerstone of most natural language processing-based solutions. The training of an effective embedding depends on a large corpus of relevant documents. However, such corpus is not always available, especially for specialized heavy industries such as oil, mining, or steel. To address the problem, this paper proposes a semi-supervised learning framework to create document corpus and embedding starting from an industry taxonomy, along with a very limited set of relevant positive and negative documents. Our solution organizes candidate documents into a graph and adopts different explore and exploit strategies to iteratively create the corpus and its embedding. At each iteration, two metrics, called Coverage and Context Similarity, are used as proxy to measure the quality of the results. Our experiments demonstrate how an embedding created by our solution is more effective than the one created by processing thousands of industry-specific document pages. We also explore using our embedding in downstream tasks, such as building an industry specific classification model given labeled training data, as well as classifying unlabeled documents according to industry taxonomy terms.

References

[1]

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2--3 (2002), 235--256.

[2]

Pável Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, Berthier Ribeiro-Neto, and Marcos André Gonçalves. 2003. Combining link-based and content-based methods for web document classification. In Proceedings of the twelfth international conference on Information and knowledge management. ACM, 394--401.

Digital Library

[3]

Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt. 2016. Representation Learning for Very Short Texts Using Weighted Word Embedding Aggregation. Pattern Recogn. Lett. 80, C (Sept. 2016), 150--156. https: //doi.org/10.1016/j.patrec.2016.06.012

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 213--220.

Digital Library

[6]

Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, and Philip S Yu. 2006. Text classification without negative examples revisit. IEEE transactions on Knowledge and Data Engineering 18, 1 (2006), 6--20.

[7]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[8]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 328--339.

[9]

Jonathan Goldsmith. 2014. Wikipedia API for Python. (November 15th 2014). available at: https://pypi.org/project/wikipedia/ (Aug. 2018).

[10]

Vinko KodÅ¿oman. [n. d.]. Pseudo-labeling a simple semi-supervised learning method Pseudo-labeling a simple semi-supervised learning method. ([n. d.]).

[11]

Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3. 2.

[12]

Wee Sun Lee and Bing Liu. 2003. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, Vol. 3. 448--455.

Digital Library

[13]

Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365.html

Digital Library

[14]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In The IEEE International Conference on Computer Vision (ICCV).

[15]

Zhigang Liu, Wenzhong Shi, Deren Li, and Qianqing Qin. 2005. Partially supervised classification--based on weighted unlabeled samples support vector machine. In International Conference on Advanced Data Mining and Applications. Springer, 118--129.

Digital Library

[16]

Larry M Manevitz and Malik Yousef. 2001. One-class SVMs for document classification. Journal of machine Learning research 2, Dec (2001), 139--154.

Digital Library

[17]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[18]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

[19]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[20]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.

[21]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/ languageunsupervised/language understanding paper. pdf (2018).

[22]

Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/ 884893/en.

[23]

Arpita Roy, Youngja Park, and SHimei Pan. 2017. Learning Domain-SpecificWord Embeddings from Sparse Cybersecurity Texts. arXiv preprint arXiv:1709.07470 (2017).

[24]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[25]

Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 90--94.

Digital Library

Cited By

Besharati Moghaddam FLopez ADe Vuyst SGautama S(2024)Natural Language Processing in Knowledge-Based Support for Operator AssistanceApplied Sciences10.3390/app1407276614:7(2766)Online publication date: 26-Mar-2024
https://doi.org/10.3390/app14072766
Bhamidipaty AKhabiri EAgrawal BLi YElkind E(2023)SiWareProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/829(7115-7118)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/829
Shamoug ACranefield SDick G(2023)SEmHuS: a semantically embedded humanitarian spaceJournal of International Humanitarian Action10.1186/s41018-023-00135-48:1Online publication date: 7-Mar-2023
https://doi.org/10.1186/s41018-023-00135-4
Show More Cited By

Index Terms

Industry Specific Word Embedding and its Application in Log Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Ontologies
    2. Retrieval models and ranking
      1. Information retrieval diversity
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration
    2. Web searching and information discovery

Recommendations

Learning class-specific word embeddings
Abstract
Recent years have seen the success of applying word embedding algorithms to natural language processing (NLP) tasks. Most word embedding algorithms only produce a single embedding per word. This makes the learned embeddings indiscriminative since ...
Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora
Abstract
Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. ...
Deep analysis of word sense disambiguation via semi-supervised learning and neural word representations
Highlights
- Ambiguity is a challenging task in text mining addressed for word-sense disambiguation algorithms.
Abstract
Word Sense Disambiguation (WSD) aims to determine the meaning of a word in context. Different approaches have been proposed in supervised and unsupervised domains. In most cases, supervised learning provides superior WSD performance. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

November 2019

3373 pages

ISBN:9781450369763

DOI:10.1145/3357384

General Chairs:
Wenwu Zhu
Tsinghua University, China
,
Dacheng Tao
University of Massachusetts, USA
,
Xueqi Cheng
Institute of Computing Technology, CAS, China
,
Program Chairs:
Peng Cui
Tsinghua University, China
,
Elke Rundensteiner
Worcester Polytechnic Institute, USA
,
David Carmel
Amazon Research, USA
,
Qi He
LinkedIn, USA
,
Jeffrey Xu Yu
Chinese University of Hong Kong, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '19

Sponsor:

CIKM '19: The 28th ACM International Conference on Information and Knowledge Management

November 3 - 7, 2019

Beijing, China

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
618
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)8

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Besharati Moghaddam FLopez ADe Vuyst SGautama S(2024)Natural Language Processing in Knowledge-Based Support for Operator AssistanceApplied Sciences10.3390/app1407276614:7(2766)Online publication date: 26-Mar-2024
https://doi.org/10.3390/app14072766
Bhamidipaty AKhabiri EAgrawal BLi YElkind E(2023)SiWareProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/829(7115-7118)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/829
Shamoug ACranefield SDick G(2023)SEmHuS: a semantically embedded humanitarian spaceJournal of International Humanitarian Action10.1186/s41018-023-00135-48:1Online publication date: 7-Mar-2023
https://doi.org/10.1186/s41018-023-00135-4
Bhardwaj ADeep AVeeramani DZhou S(2022)A Custom Word Embedding Model for Clustering of Maintenance RecordsIEEE Transactions on Industrial Informatics10.1109/TII.2021.307952118:2(816-826)Online publication date: Feb-2022
https://doi.org/10.1109/TII.2021.3079521
Kong JLi WLiu ZLiao BQiu JHsieh CCai YZhang SDemartini GZuccon GCulpepper JHuang ZTong H(2021)Fast Extraction of Word Embedding from Q-contextsProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482343(873-882)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482343
Ulku OGozuacik NTanberk SAydin MZaim A(2021)Software Log Classification in Telecommunication Industry2021 6th International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK52708.2021.9558985(348-353)Online publication date: 15-Sep-2021
https://doi.org/10.1109/UBMK52708.2021.9558985
Cagliero LQuatra M(2021)Inferring Multilingual Domain-Specific Word Embeddings From Large Document CorporaIEEE Access10.1109/ACCESS.2021.31180939(137309-137321)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3118093
Alfeo ACimino MVaglini G(2021)Technological troubleshooting based on sentence embedding with deep transformersJournal of Intelligent Manufacturing10.1007/s10845-021-01797-w32:6(1699-1710)Online publication date: 7-Jun-2021
https://doi.org/10.1007/s10845-021-01797-w
Machado Se Sá J(2021)Machine Learning and Natural Language Processing in Domain Classification of Scientific Knowledge Objects: A ReviewAdvances in Information and Communication10.1007/978-3-030-73103-8_55(773-784)Online publication date: 16-Apr-2021
https://doi.org/10.1007/978-3-030-73103-8_55
Pan YStark R(2020)An Ensemble Learning based Hierarchical Multi-label Classification Approach to Identify Impacts of Engineering Changes2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00190(1260-1267)Online publication date: Nov-2020
https://doi.org/10.1109/ICTAI50040.2020.00190
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents