I am a final-year ELLIS PhD student at CIS at LMU Munich and LTL at the University of Cambridge, supervised by Prof. Hinrich Schütze and Prof. Anna Korhonen. I’ve had the opportunity to complete research internships at Google in Mountain View and Amazon in Madrid.

My research focuses on improving LLM capabilities through effective data utilization and synthetic dataset generation, with a particular emphasis on corpus-mining, counterfactuality, robustness, and multilinguality. Below are key questions and findings from my work:

Data repurposing. How to generate high-quality synthetic datasets with LLMs?

  1. Introduced reverse instructions to repurpose existing human-written texts for instruction tuning, improving long-form output quality.
  2. Developed MURI (Multilingual Reverse Instructions), creating instruction-tuning datasets for 200 languages by repurposing multilingual human-written corpora.
  3. Co-developed CRAFT, a method for generating task-specific synthetic datasets by retrieving and rewriting relevant documents from large-scale corpora, showing competitive results to human-annotated datasets across various tasks.

Counterfactuality/Robustness. How to effectively create counterfactual datasets and improve model robustness/capabilities?

Multilinguality and Bias. Contributing to multilingual NLP and bias recognition.

  • Designed one of the first multilingual relation extraction datasets covering six languages.
  • Demonstrated significant differences in intrinsic bias toward nationalities among various monolingual models (e.g., Arabic, Turkish, German BERTs).
  • Analyzed gender-occupation bias in LLMs, linking it to pretraining data, and examining the effects of instruction tuning, PPO/DPO on bias mitigation.
  • Turkish-specific contributions: As a Turkish researcher, I've contributed to various Turkish NLP resources: TurkishMMLU, sentiment analysis, and dependency parsing, as well as several Github repositories [1, 2, 3], and delivered educational talks. I also co-organized the first Turkic NLP workshop, SIGTURK, at ACL 2024.

News

October 2024: 4 papers accepted at EMNLP 2024: LongForm, TurkishMMLU, SynthEval, CovERed.

September 2024: I am visiting the Language Technology Lab at the University of Cambridge.

August 2024: I have attended ACL 2024 to co-organize the first Turkic NLP workshop, SIGTURK.

May 2024: I have presented LongForm and Hallucination Augmented Recitations at the DPFM workshop at ICLR 2024.

May 2024: I have attended LREC-COLING 2024 to present SilverAlign.

December 2023: I have attended EMNLP 2023 to present MEAL and Language-Agnostic Bias Detection in Language Models.

June 2023: I will be in Mountain View for 3 months as a research intern at Google, focusing on attribution and counterfactuality in large language models.

Selected Publications

  1. MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
    Abdullatif Köksal, Marion Thaler, Ayyoob Imani, Ahmet Üstün, Anna Korhonen, Hinrich Schütze
    Submitted to TACL. 2024. 💻 Code.
  2. CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation
    Ingo Ziegler*, Abdullatif Köksal*, Desmond Elliott, Hinrich Schütze
    Submitted to TACL. 2024. 💻 Code.
  3. LongForm: Effective Instruction Tuning with Reverse Instructions
    Abdullatif Köksal, Timo Schick, Anna Korhonen, Hinrich Schütze
    EMNLP Findings. 2024.
  4. Consistent Document-Level Relation Extraction via Counterfactuals
    Ali Modarressi, Abdullatif Köksal, Hinrich Schütze
    EMNLP Findings. 2024.
  5. TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
    Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Şenel, Anna Korhonen, Hinrich Schütze
    EMNLP Findings. 2024.
  6. SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
    Raoyuan Zhao, Abdullatif Köksal, Yihong Liu, Leonie Weissweiler, Anna Korhonen, Hinrich Schütze
    EMNLP Findings. 2024.
  7. Hallucination Augmented Recitations for Language Models
    Abdullatif Köksal, Renat Aksitov, Chung-Ching Chang
    Submitted to COLING. 2024.
  8. MEAL: Stable and Active Learning for Few-Shot Prompting
    Abdullatif Köksal, Timo Schick, Hinrich Schütze
    EMNLP Findings. 2023.
  9. Language-Agnostic Bias Detection in Language Models with Bias Probing
    Abdullatif Köksal, Omer F. Yalcin, Ahmet Akbiyik, M. Tahir Kilavuz, Anna Korhonen, Hinrich Schütze
    EMNLP Findings. 2023.
  10. The better your Syntax, the better your Semantics?
    Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, Hinrich Schütze EMNLP. 2022.
  11. Balancing Methods for Multilabel Text Classification with Long-Tailed Class Distribution
    Yi Huang, Buse Giledereli, Abdullatif Köksal, Arzucan Özgür, Elif Ozkirimli
    EMNLP. 2021.
  12. The RELX Dataset and Matching the Multilingual Blanks for Cross-lingual Relation Classification
    Abdullatif Köksal, Arzucan Özgür
    EMNLP Findings. 2020.