Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15048))

Included in the following conference series:

  • 290 Accesses

Abstract

In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset’s construction and cleaning, fostering collaboration and knowledge sharing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The dataset’s name, “Bella Turca,” represents a blend of Latin and Italian, signifying “Beautiful Turkish Language.” Originally, our intention was to have a name in Latin. However, we opted to incorporate “Bella” from Italian, a Romance language, in order to emphasize the elegance, richness, and aesthetic appeal of the Turkish language. For the second word, we sought a term that would harmonize with “Bella.” We decided against “lingua Turcica” to avoid excessive length. Instead, we substituted “person” with “language,” as it posed no harm, resulting in either “Turca” or “Turcus,” denoting “Turkish person” in the feminine or masculine form, respectively. We chose “Turca” for two reasons: it rhymes with “Bella,” and the author, being female, favored the feminine word form. In essence, we believe that only a language as rich as Latin can truly extol the richness and beauty of the Turkish language.

  2. 2.

    https://huggingface.co/datasets/turkish-nlp-suite/BellaTurca.

  3. 3.

    https://github.com/turkish-nlp-suite/bella-turca-cleaners.

  4. 4.

    https://stars.bilkent.edu.tr/turkce/.

  5. 5.

    The Council of Higher Education, https://acikerisim.yok.gov.tr/acik-erisim.

  6. 6.

    Offered by TÜBİTAK, The Scientific and Technological Research Council Of Türkiye, https://dergipark.org.tr.

  7. 7.

    https://ekitap.ktb.gov.tr/.

  8. 8.

    https://tr.wikipedia.org/wiki/Altkitap.

  9. 9.

    https://www.anadolu.edu.tr/en/open-education.

  10. 10.

    https://huggingface.co/datasets/oscar-corpus/OSCAR-2109.

  11. 11.

    https://huggingface.co/datasets/oscar-corpus/OSCAR-2201.

  12. 12.

    https://huggingface.co/datasets/oscar-corpus/OSCAR-2301.

  13. 13.

    https://huggingface.co/datasets/turkish-nlp-suite/ForumSohbetleri.

  14. 14.

    https://pypi.org/project/pycld2/.

  15. 15.

    The term “quality filter,” although commonly used in literature, may not accurately depict the result of filtering a dataset. The term “quality” can be interpreted as a judgment on the informativeness, comprehensiveness, or other subjective characteristics valued by humans. However, it is important to note that the filters employed in Bella Turca and other language model endeavors are based on criteria that inherently carry ideological implications [13].

  16. 16.

    https://pypi.org/project/PyMuPDF/.

  17. 17.

    https://www.regulations.gov/comment/COLC-2023-0006-8762.

References

  1. Abadji, J., Ortiz Suarez, P., Romary, L., Sagot, B.: Towards a Cleaner Document-Oriented Multilingual Crawled Corpus, January 2022. arXiv e-prints arXiv:2201.06642

  2. Barbaresi, A.: Trafilatura: a web scraping library and command-line tool for text discovery and extraction. In: Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 122–131. Association for Computational Linguistics (2021), https://aclanthology.org/2021.acl-demo.15

  3. BigScience Workshop et al.: Bloom: A 176b-parameter open-access multilingual language model (2023)

    Google Scholar 

  4. Black, S., et al.: Gpt-neox-20b: an open-source autoregressive language model (2022)

    Google Scholar 

  5. Brown, T.B., et al.: Language models are few-shot learners (2020)

    Google Scholar 

  6. Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C.: Quantifying memorization across neural language models (2023)

    Google Scholar 

  7. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained bert model and evaluation data (2023)

    Google Scholar 

  8. Computer, T.: Redpajama: An open source recipe to reproduce llama training dataset, April 2023. https://github.com/togethercomputer/RedPajama-Data

  9. DeepMind: Scaling language models: Methods, analysis & insights from training gopher (2022)

    Google Scholar 

  10. Elazar, Y., et al.: What’s in my big data? (2024)

    Google Scholar 

  11. Gao, L., et al.: The pile: an 800gb dataset of diverse text for language modeling (2020)

    Google Scholar 

  12. Gunasekar, S., et al.: Textbooks are all you need (2023)

    Google Scholar 

  13. Gururangan, S., et al.: Whose language counts as high quality? measuring language ideologies in text data selection. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2562–2580. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, December 2022. https://doi.org/10.18653/v1/2022.emnlp-main.165, https://aclanthology.org/2022.emnlp-main.165

  14. Heafield, K.: KenLM: Faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh, Scotland, July 2011. https://aclanthology.org/W11-2123

  15. Hoffmann, J., Borgeaud, S., et al.: Training compute-optimal large language models (2022)

    Google Scholar 

  16. Hugo Laurençon et al.: The bigscience roots corpus: A 1.6tb composite multilingual dataset (2023)

    Google Scholar 

  17. Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models (2023)

    Google Scholar 

  18. Kaplan, J., et al.: Scaling laws for neural language models (2020)

    Google Scholar 

  19. Kesgin, H.T., Yuce, M.K., Amasyali, M.F.: Developing and evaluating tiny to medium-sized turkish bert models (2023)

    Google Scholar 

  20. Le, H., et al.: FlauBERT: Unsupervised language model pre-training for French. In: Calzolari, N., et al. (eds.) Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2479–2490. European Language Resources Association, Marseille, France, May 2020. https://aclanthology.org/2020.lrec-1.302

  21. Lepikhin, D., et al.: Gshard: Scaling giant models with conditional computation and automatic sharding (2020)

    Google Scholar 

  22. Lieber, O., Sharir, O., Lenz, B., Shoham, Y.: Jurassic-1: Technical details and evaluation. Tech. rep., AI21 Labs, August 2021

    Google Scholar 

  23. Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach (2019)

    Google Scholar 

  24. Penedo, G., et al.: The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only (2023)

    Google Scholar 

  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://api.semanticscholar.org/CorpusID:160025533

  26. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2023)

    Google Scholar 

  27. Rosset, C.: Turing-nlg: A 17-billion-parameter language model by microsoft (2019)

    Google Scholar 

  28. Schweter, S.: Berturk - bert models for turkish, April 2020. https://doi.org/10.5281/zenodo.3770924. https://doi.org/10.5281/zenodo.3770924

  29. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-lm: Training multi-billion parameter language models using model parallelism (2020)

    Google Scholar 

  30. Soldaini, L., et al.: Dolma: an open corpus of three trillion tokens for language model pretraining research (2024)

    Google Scholar 

  31. Souza, F., Nogueira, R., Lotufo, R.: Bert models for Brazilian portuguese: pretraining, evaluation and tokenization analysis. Appl. Soft Comput. 149, 110901 (2023)

    Google Scholar 

  32. Tas, N.: Roberturk: Adjusting roberta for Turkish (2024)

    Google Scholar 

  33. Team, P.: Palm 2 technical report (2023)

    Google Scholar 

  34. Team, Q.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  35. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012

    Google Scholar 

  36. Touvron, H., et al.: Llama: open and efficient foundation language models (2023)

    Google Scholar 

  37. Virtanen, A., et al.: Multilingual is not enough: Bert for finnish (2019)

    Google Scholar 

  38. Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Calzolari, N., et al.: (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, May 2018. https://aclanthology.org/L18-1686

  39. Wang, B., Komatsuzaki, A.: GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax

  40. Zhang, S., et al.: Opt: Open pre-trained transformer language models (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duygu Altinok .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article. The authors did not receive support from any organization for the submitted work.

A Data Samples

A Data Samples

Below are two random, unbiased samples from each of Bella Turca’s collections, which were selected from the train split. To accommodate the page length, we made some adjustments, particularly in cropping certain samples, such as articles.

1.1 A.1 AcademiaCrawl

figure a

1.2 A.2 Books

figure b

1.3 A.3 CleanOSCAR and CleanMC4

figure c

1.4 A.4 CraftedCrawl

figure d

1.5 A.5 CustomerTrends

figure e

1.6 A.6 ForumChats

figure f

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Altinok, D. (2024). Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15048. Springer, Cham. https://doi.org/10.1007/978-3-031-70563-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70563-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70562-5

  • Online ISBN: 978-3-031-70563-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics