Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling

Altinok, Duygu

doi:10.1007/978-3-031-70563-2_16

Duygu Altinok ORCID: orcid.org/0000-0002-2902-1206¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15048))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

290 Accesses

Abstract

In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset’s construction and cleaning, fostering collaboration and knowledge sharing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

History, development, and principles of large language models: an introductory survey

Article 14 October 2024

Impact of Quantization on Large Language Models for Portuguese Classification Tasks

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Notes

1.
The dataset’s name, “Bella Turca,” represents a blend of Latin and Italian, signifying “Beautiful Turkish Language.” Originally, our intention was to have a name in Latin. However, we opted to incorporate “Bella” from Italian, a Romance language, in order to emphasize the elegance, richness, and aesthetic appeal of the Turkish language. For the second word, we sought a term that would harmonize with “Bella.” We decided against “lingua Turcica” to avoid excessive length. Instead, we substituted “person” with “language,” as it posed no harm, resulting in either “Turca” or “Turcus,” denoting “Turkish person” in the feminine or masculine form, respectively. We chose “Turca” for two reasons: it rhymes with “Bella,” and the author, being female, favored the feminine word form. In essence, we believe that only a language as rich as Latin can truly extol the richness and beauty of the Turkish language.
2.
https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca.
3.
https://github.com/turkish-nlp-suite/bella-turca-cleaners.
4.
https://stars.bilkent.edu.tr/turkce/.
5.
The Council of Higher Education, https://acikerisim.yok.gov.tr/acik-erisim.
6.
Offered by TÜBİTAK, The Scientific and Technological Research Council Of Türkiye, https://dergipark.org.tr.
7.
https://ekitap.ktb.gov.tr/.
8.
https://tr.wikipedia.org/wiki/Altkitap.
9.
https://www.anadolu.edu.tr/en/open-education.
10.
https://huggingface.co/datasets/oscar-corpus/OSCAR-2109.
11.
https://huggingface.co/datasets/oscar-corpus/OSCAR-2201.
12.
https://huggingface.co/datasets/oscar-corpus/OSCAR-2301.
13.
https://huggingface.co/datasets/turkish-nlp-suite/ForumSohbetleri.
14.
https://pypi.org/project/pycld2/.
15.
The term “quality filter,” although commonly used in literature, may not accurately depict the result of filtering a dataset. The term “quality” can be interpreted as a judgment on the informativeness, comprehensiveness, or other subjective characteristics valued by humans. However, it is important to note that the filters employed in Bella Turca and other language model endeavors are based on criteria that inherently carry ideological implications [13].
16.
https://pypi.org/project/PyMuPDF/.
17.
https://www.regulations.gov/comment/COLC-2023-0006-8762.

References

Abadji, J., Ortiz Suarez, P., Romary, L., Sagot, B.: Towards a Cleaner Document-Oriented Multilingual Crawled Corpus, January 2022. arXiv e-prints arXiv:2201.06642
Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics (2021), https://aclanthology.org/2021.acl-demo.15
BigScience Workshop et al.: Bloom: A 176b-parameter open-access multilingual language model (2023)
Google Scholar
Black, S., et al.: Gpt-neox-20b: an open-source autoregressive language model (2022)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners (2020)
Google Scholar
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C.: Quantifying memorization across neural language models (2023)
Google Scholar
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data (2023)
Google Scholar
Computer, T.: Redpajama: An open source recipe to reproduce llama training dataset, April 2023. https://github.com/togethercomputer/RedPajama-Data
DeepMind: Scaling language models: Methods, analysis & insights from training gopher (2022)
Google Scholar
Elazar, Y., et al.: What’s in my big data? (2024)
Google Scholar
Gao, L., et al.: The pile: an 800gb dataset of diverse text for language modeling (2020)
Google Scholar
Gunasekar, S., et al.: Textbooks are all you need (2023)
Google Scholar
Gururangan, S., et al.: Whose language counts as high quality? measuring language ideologies in text data selection. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2562–2580. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, December 2022. https://doi.org/10.18653/v1/2022.emnlp-main.165, https://aclanthology.org/2022.emnlp-main.165
Heafield, K.: KenLM: Faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh, Scotland, July 2011. https://aclanthology.org/W11-2123
Hoffmann, J., Borgeaud, S., et al.: Training compute-optimal large language models (2022)
Google Scholar
Hugo Laurençon et al.: The bigscience roots corpus: A 1.6tb composite multilingual dataset (2023)
Google Scholar
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models (2023)
Google Scholar
Kaplan, J., et al.: Scaling laws for neural language models (2020)
Google Scholar
Kesgin, H.T., Yuce, M.K., Amasyali, M.F.: Developing and evaluating tiny to medium-sized turkish bert models (2023)
Google Scholar
Le, H., et al.: FlauBERT: Unsupervised language model pre-training for French. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2479–2490. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.302
Lepikhin, D., et al.: Gshard: Scaling giant models with conditional computation and automatic sharding (2020)
Google Scholar
Lieber, O., Sharir, O., Lenz, B., Shoham, Y.: Jurassic-1: Technical details and evaluation. Tech. rep., AI21 Labs, August 2021
Google Scholar
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach (2019)
Google Scholar
Penedo, G., et al.: The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only (2023)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://api.semanticscholar.org/CorpusID:160025533
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)
Google Scholar
Rosset, C.: Turing-nlg: A 17-billion-parameter language model by microsoft (2019)
Google Scholar
Schweter, S.: Berturk - bert models for turkish, April 2020. https://doi.org/10.5281/zenodo.3770924. https://doi.org/10.5281/zenodo.3770924
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism (2020)
Google Scholar
Soldaini, L., et al.: Dolma: an open corpus of three trillion tokens for language model pretraining research (2024)
Google Scholar
Souza, F., Nogueira, R., Lotufo, R.: Bert models for Brazilian portuguese: pretraining, evaluation and tokenization analysis. Appl. Soft Comput. 149, 110901 (2023)
Google Scholar
Tas, N.: Roberturk: Adjusting roberta for Turkish (2024)
Google Scholar
Team, P.: Palm 2 technical report (2023)
Google Scholar
Team, Q.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models (2023)
Google Scholar
Virtanen, A., et al.: Multilingual is not enough: Bert for finnish (2019)
Google Scholar
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Calzolari, N., et al.: (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, May 2018. https://aclanthology.org/L18-1686
Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax
Zhang, S., et al.: Opt: Open pre-trained transformer language models (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Deepgram Research, San Francisco, USA
Duygu Altinok

Authors

Duygu Altinok
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duygu Altinok .

Editor information

Editors and Affiliations

Friedrich-Alexander-Universität, Erlangen, Germany
Elmar Nöth
Masaryk University, Brno, Czech Republic
Aleš Horák
Masaryk University, Brno, Czech Republic
Petr Sojka

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article. The authors did not receive support from any organization for the submitted work.

A Data Samples

Below are two random, unbiased samples from each of Bella Turca’s collections, which were selected from the train split. To accommodate the page length, we made some adjustments, particularly in cropping certain samples, such as articles.

1.1 A.1 AcademiaCrawl

1.2 A.2 Books

1.3 A.3 CleanOSCAR and CleanMC4

1.4 A.4 CraftedCrawl

1.5 A.5 CustomerTrends

1.6 A.6 ForumChats

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altinok, D. (2024). Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-70563-2_16
Published: 01 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70562-5
Online ISBN: 978-3-031-70563-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling