Abstract
In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset’s construction and cleaning, fostering collaboration and knowledge sharing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The dataset’s name, “Bella Turca,” represents a blend of Latin and Italian, signifying “Beautiful Turkish Language.” Originally, our intention was to have a name in Latin. However, we opted to incorporate “Bella” from Italian, a Romance language, in order to emphasize the elegance, richness, and aesthetic appeal of the Turkish language. For the second word, we sought a term that would harmonize with “Bella.” We decided against “lingua Turcica” to avoid excessive length. Instead, we substituted “person” with “language,” as it posed no harm, resulting in either “Turca” or “Turcus,” denoting “Turkish person” in the feminine or masculine form, respectively. We chose “Turca” for two reasons: it rhymes with “Bella,” and the author, being female, favored the feminine word form. In essence, we believe that only a language as rich as Latin can truly extol the richness and beauty of the Turkish language.
- 2.
- 3.
- 4.
- 5.
The Council of Higher Education, https://acikerisim.yok.gov.tr/acik-erisim.
- 6.
Offered by TÜBİTAK, The Scientific and Technological Research Council Of Türkiye, https://dergipark.org.tr.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
The term “quality filter,” although commonly used in literature, may not accurately depict the result of filtering a dataset. The term “quality” can be interpreted as a judgment on the informativeness, comprehensiveness, or other subjective characteristics valued by humans. However, it is important to note that the filters employed in Bella Turca and other language model endeavors are based on criteria that inherently carry ideological implications [13].
- 16.
- 17.
References
Abadji, J., Ortiz Suarez, P., Romary, L., Sagot, B.: Towards a Cleaner Document-Oriented Multilingual Crawled Corpus, January 2022. arXiv e-prints arXiv:2201.06642
Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics (2021), https://aclanthology.org/2021.acl-demo.15
BigScience Workshop et al.: Bloom: A 176b-parameter open-access multilingual language model (2023)
Black, S., et al.: Gpt-neox-20b: an open-source autoregressive language model (2022)
Brown, T.B., et al.: Language models are few-shot learners (2020)
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C.: Quantifying memorization across neural language models (2023)
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data (2023)
Computer, T.: Redpajama: An open source recipe to reproduce llama training dataset, April 2023. https://github.com/togethercomputer/RedPajama-Data
DeepMind: Scaling language models: Methods, analysis & insights from training gopher (2022)
Elazar, Y., et al.: What’s in my big data? (2024)
Gao, L., et al.: The pile: an 800gb dataset of diverse text for language modeling (2020)
Gunasekar, S., et al.: Textbooks are all you need (2023)
Gururangan, S., et al.: Whose language counts as high quality? measuring language ideologies in text data selection. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2562–2580. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, December 2022. https://doi.org/10.18653/v1/2022.emnlp-main.165, https://aclanthology.org/2022.emnlp-main.165
Heafield, K.: KenLM: Faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh, Scotland, July 2011. https://aclanthology.org/W11-2123
Hoffmann, J., Borgeaud, S., et al.: Training compute-optimal large language models (2022)
Hugo Laurençon et al.: The bigscience roots corpus: A 1.6tb composite multilingual dataset (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models (2023)
Kaplan, J., et al.: Scaling laws for neural language models (2020)
Kesgin, H.T., Yuce, M.K., Amasyali, M.F.: Developing and evaluating tiny to medium-sized turkish bert models (2023)
Le, H., et al.: FlauBERT: Unsupervised language model pre-training for French. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2479–2490. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.302
Lepikhin, D., et al.: Gshard: Scaling giant models with conditional computation and automatic sharding (2020)
Lieber, O., Sharir, O., Lenz, B., Shoham, Y.: Jurassic-1: Technical details and evaluation. Tech. rep., AI21 Labs, August 2021
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach (2019)
Penedo, G., et al.: The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only (2023)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://api.semanticscholar.org/CorpusID:160025533
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)
Rosset, C.: Turing-nlg: A 17-billion-parameter language model by microsoft (2019)
Schweter, S.: Berturk - bert models for turkish, April 2020. https://doi.org/10.5281/zenodo.3770924. https://doi.org/10.5281/zenodo.3770924
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism (2020)
Soldaini, L., et al.: Dolma: an open corpus of three trillion tokens for language model pretraining research (2024)
Souza, F., Nogueira, R., Lotufo, R.: Bert models for Brazilian portuguese: pretraining, evaluation and tokenization analysis. Appl. Soft Comput. 149, 110901 (2023)
Tas, N.: Roberturk: Adjusting roberta for Turkish (2024)
Team, P.: Palm 2 technical report (2023)
Team, Q.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Touvron, H., et al.: Llama: open and efficient foundation language models (2023)
Virtanen, A., et al.: Multilingual is not enough: Bert for finnish (2019)
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Calzolari, N., et al.: (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, May 2018. https://aclanthology.org/L18-1686
Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax
Zhang, S., et al.: Opt: Open pre-trained transformer language models (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article. The authors did not receive support from any organization for the submitted work.
A Data Samples
A Data Samples
Below are two random, unbiased samples from each of Bella Turca’s collections, which were selected from the train split. To accommodate the page length, we made some adjustments, particularly in cropping certain samples, such as articles.
1.1 A.1 AcademiaCrawl
1.2 A.2 Books
1.3 A.3 CleanOSCAR and CleanMC4
1.4 A.4 CraftedCrawl
1.5 A.5 CustomerTrends
1.6 A.6 ForumChats
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Altinok, D. (2024). Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-70563-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70562-5
Online ISBN: 978-3-031-70563-2
eBook Packages: Computer ScienceComputer Science (R0)