Abstract
In this paper, we investigate the influence of distance metrics on the results of open-set subject classification of text documents. We utilize the Local Outlier Factor (LOF) algorithm to extend a closed-set classifier (i.e. multilayer perceptron) with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. Conducting the experiment on two different text corpora we show how the distance metric chosen for LOF (Euclidean or cosine) and a transformation of the feature space (vector representation of documents) both influence the open-set classification results. The general conclusion seems to be that the cosine distance outperforms the Euclidean distance in terms of performance of open-set classification of text documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000). https://doi.org/10.1145/335191.335388
Doan, T., Kalita, J.: Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1–7. IEEE (2017)
Fei, G., Liu, B.: Breaking the closed world assumption in text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 506–514 (2016)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7. Autres impressions : 2011 (corr.), 2013 (7e corr.)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT 1999 Proceedings of the 7th International Conference on Database Theory, pp. 217–235 (1999)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories, CLARIN-PL digital repository (2015). http://hdl.handle.net/11321/222
Pandey, N.: Density based clustering for cricket world cup tweets using cosine similarity and time parameter. In: 2015 Annual IEEE India Conference (INDICON), pp. 1–6 (2015). https://doi.org/10.1109/INDICON.2015.7443520
Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convolutional neural networks. In: Proceedings of the 14th International Conference on Natural Language Processing, pp. 466–475. NLP Association of India, Kolkata (2017)
Qian, G., Sural, S., Gu, Y., Pramanik, S.: Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, pp. 1232–1237. ACM, New York (2004). https://doi.org/10.1145/967900.968151
Walkowiak, T., Datko, S., Maciejewski, H.: Algorithm based on modified angle-based outlier factor for open-set classification of text documents. Appl. Stochast. Models Bus. Ind. 34(5), 718–729 (2018)
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence - ICAART, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)
Acknowledgement
This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Walkowiak, T., Datko, S., Maciejewski, H. (2019). Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec. In: Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., Ali, M. (eds) Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019. Lecture Notes in Computer Science(), vol 11606. Springer, Cham. https://doi.org/10.1007/978-3-030-22999-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-22999-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22998-6
Online ISBN: 978-3-030-22999-3
eBook Packages: Computer ScienceComputer Science (R0)