Authors:
Samuel Zammit
;
Fiona Sammut
and
David Suda
Affiliation:
Department of Statistics & Operations Research, University of Malta, Msida, Malta
Keyword(s):
Natural Language Processing, Word Embeddings, Word2Vec, FastText, Doc2Vec, k-means Clustering.
Abstract:
This paper aims to identify common topics in a dataset of online news portal comments made between April 2008 and January 2017 on the Times of Malta website. By making use of the FastText algorithm, Word2Vec is used to obtain word embeddings for each unique word in the dataset. Furthermore, document vectors are also obtained for each comment, where again similar comments are assigned similar representations. The resulting word and document embeddings are also clustered using k-means clustering to identify common topic clusters. The results obtained indicate that the majority of comments follow a political theme related either to party politics, foreign politics, corruption, issues of an ideological nature, or other issues. Comments related to themes such as sports, arts and culture were not common, except around years with major events. Additionally, a number of topics were identified as being more prevalent during some time periods rather than others. These include the Maltese divor
ce referendum in 2011, the Maltese citizenship scheme in 2013, Russia’s annexation of Crimea in 2014, Brexit in 2015 and corruption/Panama Papers in 2016.
(More)