Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

AlbNews: A Corpus of Headlines For Topic Modeling in Albanian

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

AlbNews: A Corpus of Headlines for Topic Modeling in Albanian

Erion Çano Dario Lamaj


Digital Philology Cognitive Science
Data Mining and Machine Learning Department of Applied Informatics
University of Vienna, Austria Comenius University Bratislava
erion.cano@univie.ac.at lamaj1@uniba.sk

Abstract 2023), this is a severe limiation which hinders the


performance and the variety of tasks that can be
The scarcity of available text corpora for low-
solved for low-resource languages.
resource languages like Albanian is a serious
hurdle for research in natural language process- This paper presents AlbNews, a corpus of 600
arXiv:2402.04028v1 [cs.CL] 6 Feb 2024

ing tasks. This paper introduces AlbNews, a labeled news headlines and 2600 unlabeled ones
collection of 600 topically labeled news head- in Albanian.1 The headlines were collected from
lines and 2600 unlabeled ones in Albanian. The online news portals and the labeled part is anno-
data can be freely used for conducting topic tated as pol for “politics”, cul for “culture”, eco for
modeling research. We report the initial clas- “economy”, and spo for “sport”. The unlabeled part
sification scores of some traditional machine
of the corpus consists of headline texts only. The
learning classifiers trained with the AlbNews
samples. These results show that basic mod-
main purpose for creating and releasing this corpus
els outrun the ensemble learning ones and can is to foster research in topic modeling and text clas-
serve as a baseline for future experiments. sification of Albanian texts. We performed some
preliminary experiments on the labeled part of the
1 Introduction corpus, trying a few traditional machine learning
models.2 The results we present can be used as
The AI (Artificial Intelligence) and LLM (Large
comparison baselines for future experiments.
Language Model) developments of today are cre-
ating a revolution in the way people solve many 2 Related Work
language-related tasks. Two pillars that paved the
path to these developments were the Transformer Topic Modeling is a common NLP research direc-
neural network architecture (Vaswani et al., 2017) tion which develops AI techniques that can use
and pretraining LLMs like BERT (Devlin et al., unlabeled text samples for conducting topic anal-
2019). The internal mechanisms that drive the ysis on text corpora. Text or topic classification,
behavior and performance of LLMs are not fully on the other hand, is slightly different and utilizes
known, but they resemble the human cognition labeled text samples for training machine learn-
mechanisms (Thoma et al., 2023). ing algorithms which are then used for predicting
Other important factors have boosted AI appli- the topic of other text units or documents. In the
cations been the increase of computing capacities context of this paper, we address both tasks, since
(Gropp et al., 2020), the development of high-level AlbNews data can be used for both of them.
software frameworks (Paszke et al., 2019), and the Some of the erliest works on Topic Modeling
creation of large text corpora. It is possible to such as Latent Semantic Indexing (Papadimitriou
use computing and software resources for solving et al., 1998), Probabilistic Latent Semantic Index-
tasks like Text Summarization, Sentiment Analysis, ing (Hofmann, 1999), and Latent Dirichlet Allo-
Topic Regnition etc., in any natural language. How- cation (Blei et al., 2003) came out in late 90s and
ever, the language of the required text corpora must early 00s. They perform semantic similarity of
usually match the natural language under study. documents based on word usage statistics. The
Most of the research corpora available today are most recent developments such as Top2Vec (An-
in English. For underrepresented or low-resource gelov, 2020) and BERTopic (Grootendorst, 2022)
languages like Albanian, few such corpora have are based on LLMs and document embeddings.
been created and they are usually small in size. 1
Download from: http://hdl.handle.net/11234/1-5411
2
As pointed out by various studies (Blaschke et al., Code at: https://github.com/erionc/AlbNews
On the other hand, text classification has been Characters Tokens
traditionally conceived as a multiclass classifica- Minimum 38 5
tion problem, given that the topic of a text unit Maximum 187 33
can be one from a few predefined categories. It Average 88.13 13.98
has been solved using traditional machine learning
classifiers such as Support Vector Machine (Basu Table 1: Headline length statistics.
et al., 2003). Nevertheless, there are also recent
Headline Topic
studies that solve the text classification tasks using
BERT or other LLMs (Sun et al., 2019). Zgjedhjet vendore: rreth 230 pol
vëzhgues të huaj në Shqipëri
As for the resources, one of the most popular cor-
Lahuta në UNESCO, komuniteti i cul
pora is the 20 Newsgroups collection.3 It contains
bartësve propozon masat mbrojtëse
English texts of 20 different predefined topic cate-
Ulet me shtatë lekë nafta! Çmim më eco
gories such as politics, religion, medicine, baseball,
i lirë dhe për benzinën e gazin
etc. Many resources in English are based on copy-
Spanja godet Italinë në minutën e spo
left texts of scientific (Meng et al., 2017; Çano and
fundit dhe shkon në finale
Roth, 2022; Nikola Nikolov and Hahnloser, 2018;
Çano and Bojar, 2019). They are large in size, but Table 2: Illustration of four data samples.
the samples are usually not annotated.
In the context of low-resource languages, and
more specifically Albanian, we are not aware of tion nowadays.5 They provide information of vari-
any corpus specifically created for Topic Modeling. ous topics such as politics, economy, sport, fashion,
There have been some similar attempts to create culture, technology, etc. The essence of the news
resources in Albanian for related NLP tasks. Alb- comes from the headlines which are typically 1 to
MoRe is a recent corpus that contains 800 movie 3 sentences long.
reviews in Albanian and the respective positive or AlbNews was created by collecting such head-
negative sentiment annotations (Çano, 2023a). It lines of Albanian news articles. Each article was
can be used to conduct research on sentiment anal- published during the period February 2022 - De-
ysis or opinion mining. Another similar resource cember 2023. Initially, about 6000 headlines were
is the collection of 10132 social media comments crawled. Some of them were droped, since they
which were manually annotated and presented by were very short or had badly-formatted content.
Kadriu et al. (2022). Only headlines consisting of at least one full and
There have also been resources to solve other properly formatted sentence were kept. At the end,
NLP tasks such as Named Entity Recognition. One a total of 3200 headlines was reached. The length
of them is AlbNER, a collection of 900 sentences statistics in characters and tokens for these final
harvested from Albanian Wikipedia (Çano, 2023b). 3200 headlines are shown in Table 1.
Each of its tokens has been manually annotated Because of the limited manpower, only 600 from
with the respective named entity tag. Finally, there the 3200 headlines were randomly selected for an-
are also resources like Shaj corpus which is de- notation. The first and the second author worked
signed for hate speech detection (Nurce et al., separately and labeled each headline as pol for “pol-
2021). It contains user comments from various so- itics”, cul for “culture”, eco for “economy”, and
cial media, annotated using the OffensEval schema spo for “sport”. In most of the cases, the two la-
for hate speech detection.4 bels from each annotator matched. The few cases
of missmatches were resolved through discussion.
Table 2 presents four labeled samples, one per each
3 AlbNews Corpus
category.
Driven by the increases in advertising revenew,
4 Preliminary Experimental Results
news websites and news pages in social media are
becoming the biggest sources of everyday informa- In this section, we present some preliminary results
obtained using the AlbNews corpus. We trained
3
http://qwone.com/~jason/20Newsgroups/
4 5
https://sites.google.com/site/ https://www.pewresearch.org/journalism/
offensevalsharedtask fact-sheet/digital-news/
a few traditional machine learning algorithms on Model Accuracy
the labeled part of the corpus and observed their Logistic Regression 0.85
performance on the topic classification task. Support Vector Machine 0.841
Decision Trees 0.5
4.1 Preprocessing and Vectorization Gradient Boosting 0.683
We performed some preprocessing steps on each XGBoost 0.541
headline text. They were first tokenized and the Random Forest 0.641
punctuation or special symbols were removed. The
white-space symbols like ‘\n’ or ‘\t’ were also Table 3: Topic classification results.
lost. The texts were also lowercased to decrease
the vocabulary (unique words). At the end, TF- of this method are Gradient Boosting (Friedman,
IDF was applied to vectorize the text words (Zhang 2001) and XGBoost (Chen and Guestrin, 2016).
et al., 2011). These preprocessing operations are
A different approach to ansemble learning is
not related to the semantics of the words and do not
called bagging and tries to generate multiple ver-
usually have any influence in topic classification
sions of a model (Breiman, 1996). The predictions
performance.
of the multiple predictors are aggregated to provide
4.2 Classification Algorithms the final predictions. One implementation of this
idea is the Random Forest method (Ho, 1995). It
One of the most successfull algorithms created in
aggregates predictions obtained from several deci-
the 90s is SVM (Support Vector Machine). It has
sion trees.
revealed itself to be fruitfull in both classification
and regression tasks (Cortes and Vapnik, 1995).
4.3 Discussion
SVM is based on the notion of hard and soft class
separation margins and was subsequently improved We trained the classifiers mentioned above with
by adding the kernel concept which makes it possi- their default parameters on 480 labeled samples
ble to separate data that are not linearly separable and tested on the remaining 120 samples. The
by means of feature space transformations (Kocsor accuracy results that were reached are shown in
and Tóth, 2004). Table 3. As we can see, Logistic Regression gives
Another simple but effective classifier is Logis- an accuracy score of 0.85 which is the highest.
tic Legression which offers good performance on SVM follows, reaching up to 0.84. Decision trees
a wide range of applications. Logistic Regression are significantly weaker, reaching only 0.5 (note
utilizes the logistic function for determining the that random guessing on four categories yelds an
probability of samples belonging to certain classes. accuracy score of 0.25). Random forest performs
Moreover, it runs quite fast. Decision trees uti- slightly better, with an accuracy score of 0.64. Even
lize hierarchical tree structures to analyze the data Gradient Boosting and XGBoost, the two boosting
features (appearing in tree branches) and make de- implementations do not perform well. They reach
cisions (tree nodes) accordingly. They have been up to 0.68 and 0.54 only.
around since many years and have shown strong The results indicate that the simpler methods
classification performance when applied on data of outrun the more advanced ensemble learning ones.
different types (Quinlan, 1986). One reason for this could be overfitting which hap-
Other successful classification models are those pens often when the data are small. Another possi-
which combine several basic algorithms to create bility would be to grid-search and optimize some of
stronger ones. This concept is knowns as ensemble the parameters of the classifiers. This could some-
learning (Brown, 2010). One family of ansemble how boost their scores, but is beyond the scope of
learners is that which is based on the boosting con- this work and could be a future work extension.
cept (Schapire, 2003). They try to find prediction Another possible future work could be experiment-
“rules of thumb” using basic models, and improving ing with LLMs such as BERT or RoBERTa (Liu
the process by repeatedly feeding different training et al., 2019). However, there are still no such LLMs
samples on each model instance. After a certain pretrained on Albanian texts. Knowledge transfer
number of iterations, the boosting algorithm com- between English and Albanian leads to poor results
bines the learned rules into a single prediction rule (Çano, 2023b), highligting the need for developing
that usually is more accurate. Two implementations LLMs pretrained with Albanian texts.
5 Conclusions Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A
scalable tree boosting system. In Proceedings of the
Experimenting on NLP tasks such as Topic Mod- 22nd ACM SIGKDD International Conference on
eling demands the creation of unsupervised, semi- Knowledge Discovery and Data Mining, KDD ’16,
page 785–794, New York, NY, USA. Association for
supervised or supervised corpora which in the cases Computing Machinery.
of low-resource languages like Albanian are un-
available, scarce or small. This work introduces Corinna Cortes and Vladimir Vapnik. 1995. Support-
AlbNews, a collection of news headlines in Alba- vector networks. Machine learning, 20(3):273–297.
nian. It conists of 600 labeled headlines and 2600 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
unlabeled headlines and aims to foster research on Kristina Toutanova. 2019. BERT: Pre-training of
Topic Modeling of Albanian texts. A set of prelimi- deep bidirectional transformers for language under-
nary experiments with tradicional machine learning standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
methods indicates that the simple ones perform bet- Computational Linguistics: Human Language Tech-
ter than those based on ensemble learning. nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
References Jerome H. Friedman. 2001. Greedy function approxi-
Dimo Angelov. 2020. Top2vec: Distributed representa- mation: A gradient boosting machine. The Annals of
tions of topics. CoRR, abs/2008.09470. Statistics, 29(5):1189 – 1232.

Maarten Grootendorst. 2022. Bertopic: Neural topic


A. Basu, C. Walters, and M. Shepherd. 2003. Sup-
modeling with a class-based tf-idf procedure. CoRR,
port vector machines for text categorization. In 36th
abs/2203.05794.
Annual Hawaii International Conference on System
Sciences, 2003. Proceedings of the, pages 7 pp.–. William Gropp, Sujata Banerjee, and Ian T. Fos-
ter. 2020. Infrastructure for artificial intelligence,
Verena Blaschke, Hinrich Schuetze, and Barbara Plank. quantum and high performance computing. CoRR,
2023. A survey of corpora for Germanic low- abs/2012.09303.
resource languages and dialects. In Proceedings
of the 24th Nordic Conference on Computational Tin Kam Ho. 1995. Random decision forests. In Pro-
Linguistics (NoDaLiDa), pages 392–414, Tórshavn, ceedings of the Third International Conference on
Faroe Islands. University of Tartu Library. Document Analysis and Recognition (Volume 1) - Vol-
ume 1, ICDAR ’95, pages 278–, Washington, DC,
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. USA. IEEE Computer Society.
2003. Latent dirichlet allocation. J. Mach. Learn.
Res., 3(null):993–1022. Thomas Hofmann. 1999. Probabilistic latent semantic
indexing. In Proceedings of the 22nd Annual Inter-
Leo Breiman. 1996. Bagging predictors. Mach. Learn., national ACM SIGIR Conference on Research and
24(2):123–140. Development in Information Retrieval, SIGIR ’99,
page 50–57, New York, NY, USA. Association for
Gavin Brown. 2010. Ensemble Learning, pages 312– Computing Machinery.
320. Springer US, Boston, MA.
Fatbardh Kadriu, Doruntina Murtezaj, Fatbardh Gashi,
Erion Çano. 2023a. Albmore: A corpus of movie re- Lule Ahmedi, Arianit Kurti, and Zenun Kastrati.
views for sentiment analysis in albanian. CoRR, 2022. Human-annotated dataset for social media
abs/2306.08526. sentiment analysis for albanian language. Data in
Brief, 43:108436.
Erion Çano. 2023b. Albner: A corpus for named entity András Kocsor and László Tóth. 2004. Application
recognition in albanian. CoRR, abs/2309.08741. of kernel-based feature space transformations and
learning methods to phoneme classification. Applied
Erion Çano and Ondřej Bojar. 2019. Efficiency met- Intelligence, 21(2):129–142.
rics for data-driven models: A text summarization
case study. In Proceedings of the 12th International Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Conference on Natural Language Generation, pages dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
229–239, Tokyo, Japan. Association for Computa- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tional Linguistics. Roberta: A robustly optimized BERT pretraining
approach. CoRR, abs/1907.11692.
Erion Çano and Benjamin Roth. 2022. Topic seg-
mentation of research article collections. CoRR, Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He,
/abs/2205.11249. Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase
generation. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 582–592, Vancouver,
Canada. Association for Computational Linguistics.
Michael Pfeiffer Nikola Nikolov and Richard Hahnloser.
2018. Data-driven summarization of scientific arti-
cles. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation
(LREC 2018), Paris, France. European Language Re-
sources Association (ELRA).
Erida Nurce, Jorgel Keci, and Leon Derczynski. 2021.
Detecting abusive albanian. CoRR, abs/2107.13592.
Christos H. Papadimitriou, Hisao Tamaki, Prabhakar
Raghavan, and Santosh Vempala. 1998. Latent se-
mantic indexing: a probabilistic analysis. In Pro-
ceedings of the Seventeenth ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Sys-
tems, PODS ’98, page 159–168, New York, NY, USA.
Association for Computing Machinery.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward
Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-
jie Bai, and Soumith Chintala. 2019. PyTorch: an
imperative style, high-performance deep learning li-
brary. Curran Associates Inc., Red Hook, NY, USA.
J. R. Quinlan. 1986. Induction of decision trees. Ma-
chine Learning, 1:81–106.

Robert E. Schapire. 2003. The Boosting Approach to


Machine Learning: An Overview, pages 149–171.
Springer New York, New York, NY.

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.


2019. How to fine-tune BERT for text classification?
CoRR, abs/1905.05583.
Lukas Thoma, Ivonne Weyers, Erion Çano, Stefan
Schweter, Jutta L Mueller, and Benjamin Roth.
2023. CogMemLM: Human-like memory mecha-
nisms improve performance and cognitive plausibil-
ity of LLMs. In Proceedings of the BabyLM Chal-
lenge at the 27th Conference on Computational Nat-
ural Language Learning, pages 180–185, Singapore.
Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Wen Zhang, Taketoshi Yoshida, and Xijin Tang. 2011.
A comparative study of tf*idf, lsi and multi-words for
text classification. Expert Systems with Applications,
38(3):2758–2765.

You might also like