AlbNews: A Corpus of Headlines For Topic Modeling in Albanian
AlbNews: A Corpus of Headlines For Topic Modeling in Albanian
AlbNews: A Corpus of Headlines For Topic Modeling in Albanian
ing tasks. This paper introduces AlbNews, a labeled news headlines and 2600 unlabeled ones
collection of 600 topically labeled news head- in Albanian.1 The headlines were collected from
lines and 2600 unlabeled ones in Albanian. The online news portals and the labeled part is anno-
data can be freely used for conducting topic tated as pol for “politics”, cul for “culture”, eco for
modeling research. We report the initial clas- “economy”, and spo for “sport”. The unlabeled part
sification scores of some traditional machine
of the corpus consists of headline texts only. The
learning classifiers trained with the AlbNews
samples. These results show that basic mod-
main purpose for creating and releasing this corpus
els outrun the ensemble learning ones and can is to foster research in topic modeling and text clas-
serve as a baseline for future experiments. sification of Albanian texts. We performed some
preliminary experiments on the labeled part of the
1 Introduction corpus, trying a few traditional machine learning
models.2 The results we present can be used as
The AI (Artificial Intelligence) and LLM (Large
comparison baselines for future experiments.
Language Model) developments of today are cre-
ating a revolution in the way people solve many 2 Related Work
language-related tasks. Two pillars that paved the
path to these developments were the Transformer Topic Modeling is a common NLP research direc-
neural network architecture (Vaswani et al., 2017) tion which develops AI techniques that can use
and pretraining LLMs like BERT (Devlin et al., unlabeled text samples for conducting topic anal-
2019). The internal mechanisms that drive the ysis on text corpora. Text or topic classification,
behavior and performance of LLMs are not fully on the other hand, is slightly different and utilizes
known, but they resemble the human cognition labeled text samples for training machine learn-
mechanisms (Thoma et al., 2023). ing algorithms which are then used for predicting
Other important factors have boosted AI appli- the topic of other text units or documents. In the
cations been the increase of computing capacities context of this paper, we address both tasks, since
(Gropp et al., 2020), the development of high-level AlbNews data can be used for both of them.
software frameworks (Paszke et al., 2019), and the Some of the erliest works on Topic Modeling
creation of large text corpora. It is possible to such as Latent Semantic Indexing (Papadimitriou
use computing and software resources for solving et al., 1998), Probabilistic Latent Semantic Index-
tasks like Text Summarization, Sentiment Analysis, ing (Hofmann, 1999), and Latent Dirichlet Allo-
Topic Regnition etc., in any natural language. How- cation (Blei et al., 2003) came out in late 90s and
ever, the language of the required text corpora must early 00s. They perform semantic similarity of
usually match the natural language under study. documents based on word usage statistics. The
Most of the research corpora available today are most recent developments such as Top2Vec (An-
in English. For underrepresented or low-resource gelov, 2020) and BERTopic (Grootendorst, 2022)
languages like Albanian, few such corpora have are based on LLMs and document embeddings.
been created and they are usually small in size. 1
Download from: http://hdl.handle.net/11234/1-5411
2
As pointed out by various studies (Blaschke et al., Code at: https://github.com/erionc/AlbNews
On the other hand, text classification has been Characters Tokens
traditionally conceived as a multiclass classifica- Minimum 38 5
tion problem, given that the topic of a text unit Maximum 187 33
can be one from a few predefined categories. It Average 88.13 13.98
has been solved using traditional machine learning
classifiers such as Support Vector Machine (Basu Table 1: Headline length statistics.
et al., 2003). Nevertheless, there are also recent
Headline Topic
studies that solve the text classification tasks using
BERT or other LLMs (Sun et al., 2019). Zgjedhjet vendore: rreth 230 pol
vëzhgues të huaj në Shqipëri
As for the resources, one of the most popular cor-
Lahuta në UNESCO, komuniteti i cul
pora is the 20 Newsgroups collection.3 It contains
bartësve propozon masat mbrojtëse
English texts of 20 different predefined topic cate-
Ulet me shtatë lekë nafta! Çmim më eco
gories such as politics, religion, medicine, baseball,
i lirë dhe për benzinën e gazin
etc. Many resources in English are based on copy-
Spanja godet Italinë në minutën e spo
left texts of scientific (Meng et al., 2017; Çano and
fundit dhe shkon në finale
Roth, 2022; Nikola Nikolov and Hahnloser, 2018;
Çano and Bojar, 2019). They are large in size, but Table 2: Illustration of four data samples.
the samples are usually not annotated.
In the context of low-resource languages, and
more specifically Albanian, we are not aware of tion nowadays.5 They provide information of vari-
any corpus specifically created for Topic Modeling. ous topics such as politics, economy, sport, fashion,
There have been some similar attempts to create culture, technology, etc. The essence of the news
resources in Albanian for related NLP tasks. Alb- comes from the headlines which are typically 1 to
MoRe is a recent corpus that contains 800 movie 3 sentences long.
reviews in Albanian and the respective positive or AlbNews was created by collecting such head-
negative sentiment annotations (Çano, 2023a). It lines of Albanian news articles. Each article was
can be used to conduct research on sentiment anal- published during the period February 2022 - De-
ysis or opinion mining. Another similar resource cember 2023. Initially, about 6000 headlines were
is the collection of 10132 social media comments crawled. Some of them were droped, since they
which were manually annotated and presented by were very short or had badly-formatted content.
Kadriu et al. (2022). Only headlines consisting of at least one full and
There have also been resources to solve other properly formatted sentence were kept. At the end,
NLP tasks such as Named Entity Recognition. One a total of 3200 headlines was reached. The length
of them is AlbNER, a collection of 900 sentences statistics in characters and tokens for these final
harvested from Albanian Wikipedia (Çano, 2023b). 3200 headlines are shown in Table 1.
Each of its tokens has been manually annotated Because of the limited manpower, only 600 from
with the respective named entity tag. Finally, there the 3200 headlines were randomly selected for an-
are also resources like Shaj corpus which is de- notation. The first and the second author worked
signed for hate speech detection (Nurce et al., separately and labeled each headline as pol for “pol-
2021). It contains user comments from various so- itics”, cul for “culture”, eco for “economy”, and
cial media, annotated using the OffensEval schema spo for “sport”. In most of the cases, the two la-
for hate speech detection.4 bels from each annotator matched. The few cases
of missmatches were resolved through discussion.
Table 2 presents four labeled samples, one per each
3 AlbNews Corpus
category.
Driven by the increases in advertising revenew,
4 Preliminary Experimental Results
news websites and news pages in social media are
becoming the biggest sources of everyday informa- In this section, we present some preliminary results
obtained using the AlbNews corpus. We trained
3
http://qwone.com/~jason/20Newsgroups/
4 5
https://sites.google.com/site/ https://www.pewresearch.org/journalism/
offensevalsharedtask fact-sheet/digital-news/
a few traditional machine learning algorithms on Model Accuracy
the labeled part of the corpus and observed their Logistic Regression 0.85
performance on the topic classification task. Support Vector Machine 0.841
Decision Trees 0.5
4.1 Preprocessing and Vectorization Gradient Boosting 0.683
We performed some preprocessing steps on each XGBoost 0.541
headline text. They were first tokenized and the Random Forest 0.641
punctuation or special symbols were removed. The
white-space symbols like ‘\n’ or ‘\t’ were also Table 3: Topic classification results.
lost. The texts were also lowercased to decrease
the vocabulary (unique words). At the end, TF- of this method are Gradient Boosting (Friedman,
IDF was applied to vectorize the text words (Zhang 2001) and XGBoost (Chen and Guestrin, 2016).
et al., 2011). These preprocessing operations are
A different approach to ansemble learning is
not related to the semantics of the words and do not
called bagging and tries to generate multiple ver-
usually have any influence in topic classification
sions of a model (Breiman, 1996). The predictions
performance.
of the multiple predictors are aggregated to provide
4.2 Classification Algorithms the final predictions. One implementation of this
idea is the Random Forest method (Ho, 1995). It
One of the most successfull algorithms created in
aggregates predictions obtained from several deci-
the 90s is SVM (Support Vector Machine). It has
sion trees.
revealed itself to be fruitfull in both classification
and regression tasks (Cortes and Vapnik, 1995).
4.3 Discussion
SVM is based on the notion of hard and soft class
separation margins and was subsequently improved We trained the classifiers mentioned above with
by adding the kernel concept which makes it possi- their default parameters on 480 labeled samples
ble to separate data that are not linearly separable and tested on the remaining 120 samples. The
by means of feature space transformations (Kocsor accuracy results that were reached are shown in
and Tóth, 2004). Table 3. As we can see, Logistic Regression gives
Another simple but effective classifier is Logis- an accuracy score of 0.85 which is the highest.
tic Legression which offers good performance on SVM follows, reaching up to 0.84. Decision trees
a wide range of applications. Logistic Regression are significantly weaker, reaching only 0.5 (note
utilizes the logistic function for determining the that random guessing on four categories yelds an
probability of samples belonging to certain classes. accuracy score of 0.25). Random forest performs
Moreover, it runs quite fast. Decision trees uti- slightly better, with an accuracy score of 0.64. Even
lize hierarchical tree structures to analyze the data Gradient Boosting and XGBoost, the two boosting
features (appearing in tree branches) and make de- implementations do not perform well. They reach
cisions (tree nodes) accordingly. They have been up to 0.68 and 0.54 only.
around since many years and have shown strong The results indicate that the simpler methods
classification performance when applied on data of outrun the more advanced ensemble learning ones.
different types (Quinlan, 1986). One reason for this could be overfitting which hap-
Other successful classification models are those pens often when the data are small. Another possi-
which combine several basic algorithms to create bility would be to grid-search and optimize some of
stronger ones. This concept is knowns as ensemble the parameters of the classifiers. This could some-
learning (Brown, 2010). One family of ansemble how boost their scores, but is beyond the scope of
learners is that which is based on the boosting con- this work and could be a future work extension.
cept (Schapire, 2003). They try to find prediction Another possible future work could be experiment-
“rules of thumb” using basic models, and improving ing with LLMs such as BERT or RoBERTa (Liu
the process by repeatedly feeding different training et al., 2019). However, there are still no such LLMs
samples on each model instance. After a certain pretrained on Albanian texts. Knowledge transfer
number of iterations, the boosting algorithm com- between English and Albanian leads to poor results
bines the learned rules into a single prediction rule (Çano, 2023b), highligting the need for developing
that usually is more accurate. Two implementations LLMs pretrained with Albanian texts.
5 Conclusions Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A
scalable tree boosting system. In Proceedings of the
Experimenting on NLP tasks such as Topic Mod- 22nd ACM SIGKDD International Conference on
eling demands the creation of unsupervised, semi- Knowledge Discovery and Data Mining, KDD ’16,
page 785–794, New York, NY, USA. Association for
supervised or supervised corpora which in the cases Computing Machinery.
of low-resource languages like Albanian are un-
available, scarce or small. This work introduces Corinna Cortes and Vladimir Vapnik. 1995. Support-
AlbNews, a collection of news headlines in Alba- vector networks. Machine learning, 20(3):273–297.
nian. It conists of 600 labeled headlines and 2600 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
unlabeled headlines and aims to foster research on Kristina Toutanova. 2019. BERT: Pre-training of
Topic Modeling of Albanian texts. A set of prelimi- deep bidirectional transformers for language under-
nary experiments with tradicional machine learning standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
methods indicates that the simple ones perform bet- Computational Linguistics: Human Language Tech-
ter than those based on ensemble learning. nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
References Jerome H. Friedman. 2001. Greedy function approxi-
Dimo Angelov. 2020. Top2vec: Distributed representa- mation: A gradient boosting machine. The Annals of
tions of topics. CoRR, abs/2008.09470. Statistics, 29(5):1189 – 1232.