Generalists vs. Specialists: Evaluating Large Language Models for Urdu

Samee Arif
LUMS
23100088@lums.edu.pk \AndAbdul Hameed Azeemi
LUMS
abdul.azeemi@lums.edu.pk \ANDAgha Ali Raza
LUMS
agha.ali.raza@lums.edu.pk \AndAwais Athar
EMBL-EBI
awais@ebi.ac.uk

Abstract

In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.

Samee Arif LUMS 23100088@lums.edu.pk Abdul Hameed Azeemi LUMS abdul.azeemi@lums.edu.pk

Agha Ali Raza LUMS agha.ali.raza@lums.edu.pk Awais Athar EMBL-EBI awais@ebi.ac.uk

1 Introduction

In recent years the introduction of LLMs including GPT (Brown et al., 2020; OpenAI, 2024) and Llama (Touvron et al., 2023a, b) has led to a significant advancement in NLP. However, expanding the reach of NLP to low-resource languages is crucial for advancing multilingual AI systems and promoting technological inclusivity. Urdu, with over 70 million native speakers, stands as a significant yet underserved language in the NLP domain (Blasi et al., 2022).

For this study, we classify LLMs into two distinct categories:

1.

Generalists: General-purpose models capable of performing a wide variety of tasks. We will use GPT-4-Turbo (abbreviated as GPT-4) trained on dataset up to Dec 2023 and Llama-3-8b-Instruct (abbreviated as Llama) as the generalist models.
2.

Specialists: Special-purpose models fine-tuned to perform specific tasks. We use XLM-Roberta-large (abbreviated as XLM-R) (Conneau et al., 2020), mT5-large (abbreviated as mT5) (Xue et al., 2021), and a fine-tuned version of Llama-3-8b-Instruct (abbreviated as Llama-FT) as the specialist models.

We present a comprehensive evaluation of generalist and specialist models for classification and generation tasks, exploring their strengths and limitations. Table 1 outlines the sub-tasks associated with both categories, illustrating the scope of our evaluation.

Classification	Generation
Sentiment Analysis	Question Answering
Abuse Detection	Summarization
Sarcasm Detection	Paraphrasing
Fake News Detection	Transliteration
Topic Classification	Translation (en-ur)
PoS Tagging	Translation (ur-en)
NER Tagging

Table 1: Sub-tasks for classification and generation.

In this paper, we aim to answer: (1) how each category of models performs on Urdu language tasks, and (2) which model type is more effective in practical applications for Urdu-speaking users. Specifically, we seek to determine if the added specialization of the specialist models translates into significant performance gains over the generalist models in the defined tasks.

Our contributions can be summarized as follows:

1.

We fine-tune XLM-R, mT5, and Llama on classification and generation tasks to optimize their performance for the defined tasks.
2.

We compare the performance of both generalist and specialist models on a smaller, controlled test set consisting of min(1000, len(dataset["test"])) samples, and provide a detailed comparison using various metrics. The exact size of each test set is given in Appendix D. Given the inherent challenges of working with low-resource languages like Urdu, our test set size reflects the current limitations in available data.
3.

We conduct a human evaluation of the outputs from the generation tasks and compare these results with the automated evaluations performed by GPT-4 and Llama.

The code, model outputs, and the human and LLM evaluations are publicly available on GitHub¹¹1https://github.com/ulrs0/Generalists-vs-Specialists.

2 Related Work

In recent years there has been a growing interest in the performance of LLMs across various languages. The MEGA benchmark (Ahuja et al., 2023) evaluates 16 NLP datasets across 70 languages. They compare the performance of BLOOMZ, GPT models, and State of the Art (SOTA) non-autoregressive models. MEGAVERSE Ahuja et al. (2024) builds on top of the MEGA benchmark and evaluates the non-English capabilities of GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma. IndicGenBench (Singh et al., 2024) evaluates LLMs on user-facing generation tasks across a set of 29 Indic languages. They perform evaluation on both proprietary and open-source LLMs including GP-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM, and Llama. They cover the following tasks: cross-lingual summarization, machine translation, and cross-lingual question-answering. Zhao et al., 2024 conduct an evaluation of Llama’s response quality based on LLM-Eval (Zhang et al., 2023), a benchmark comprising instruction tasks from 17 categories. Mujadia et al., 2024 presents a comprehensive translation evaluation using Llama for English and Indian languages. They found that LLM-based evaluator achieves a comparable score with human judgement.

Khondaker et al., 2023a presents a in-depth evaluaton of BLOOMZ, ChatGPT and specialist models AraT5 and MARBERTv2. They evaluated these models for Arabic on 44 distinct language understanding and generation tasks. They also present a comparison between the human evaluation and GPT-4 evaluation. Additionally, they observed that the English prompt works better than the Arabic prompt. Abdelali et al., 2024 provides a benchmark of LLMs against SOTA models for Arabic NLP. They evaluated GPT-3.5-Turbo, GPT-4, BLOOMZ, and Jais-13b-Chat across 33 tasks using 61 publicly available datasets, resulting in 330+ experimental setups. They observed that SOTA models generally outperform LLMs in zero-shot learning, larger models with few-shot learning techniques significantly reduce the performance gaps.

Existing multilingual NLP benchmarks often lack extensive language-specific evaluation and comparison against task-specific models. Tahir et al., 2024 address this gap by evaluating GPT-3.5-Turbo, Llama2-7B-Chat, and Bloomz (3B and 7B1) across 14 tasks using 15 Urdu datasets in a zero-shot setting, comparing their performance against task-specific models. Their findings reveal that task-specific models (Support Vector Machine (SVM), Decision Tree, and m-BERT, etc) generally outperform the mentioned LLMs in Urdu NLP tasks with zero-shot learning. However, they do not perform few-shot prompting, chain-of-thought prompting, or LLM fine-tuning.

3 Datasets

3.1 Classification

Sentiment Analysis. We use the Urdu IMDB sentiment analysis dataset²²2https://github.com/urduhack/resources/releases/tag/imdb_urdu_reviews_v1.0.0, which is a translated version of the original IMDB dataset (Maas et al., 2011). It is translated using Google Translator, and comprises 50,000 movie reviews.

Abuse Detection. We use the dataset by Akhter et al., 2020 which has 2,171 entries and the dataset by Amjad et al., 2022 which consists of 3,502 entries.

Sarcasm Detection. For this task we use Urdu Sarcastic Tweets Dataset (Khan and Najeeb, 2023) which consists of 19,955 tagged tweets.

Fake News Detection. We use the fake news dataset by Amjad et al., 2020 which is a mixture of real and translated data. It comprises 1,300 labeled news articles.

Topic Classification. For this task, we use a dataset³³3https://github.com/mwaseemrandhawa/Urdu-News-Headline-Dataset consisting of 137,161 news headlines categorized into the following topics: business, entertainment, health, politics, science, sports, world and other.

Part-of-Speech Tagging. For this task, we use the Universal Dependencies (Nivre et al., 2020) dataset, which consists of 5,130 sentences, annotated with a part-of-speech tag for every word. For GPT-4 and Llama, we wrap the word for which we want to predict the part-of-speech tag in <hl> tags. The structure of the data is given in Figure 1.

Refer to caption — Figure 1: PoS Data Structure for Llama

Named Entity Recognition. We use the dataset⁴⁴4https://huggingface.co/datasets/mirfan899/urdu-ner available on Hugging Face which consists of 33,748 sentences, annotated with a NER tag for every word. For GPT-4 and Llama, we applied the same data structuring method as done in the part-of-speech tagging task.

3.2 Generation

Given our resource constraints, we could not set the maximum sequence length for model training to the longest entries in each dataset. Instead, we filter the datasets to include only those entries that fell within a manageable maximum length. This approach allowed us to optimize the training process and ensure efficient use of computational resources while still maintaining a representative sample of the data.

Question-Answering. We use three datasets for question-answering: (1) UQA (Arif et al., 2024), consisting of 88,829 answerable questions; (2) UQuAD⁵⁵5https://github.com/ahsanfarooqui/UQuAD---Urdu-Question-Answer-Dataset/tree/main, containing 139 questions; and (3) Wiki-UQA⁶⁶6https://huggingface.co/datasets/uqa/Wiki-UQA, a manually generated dataset from Wikipedia articles, comprising 210 questions.

Summarization. For this task, we use the XSUMUrdu (Munaf et al., 2023) dataset, selecting a subset of 76,626 entries based on the maximum length used during model training.

Paraphrasing. For paraphrasing we use the dataset⁷⁷7https://huggingface.co/datasets/mwz/ur_para available on Hugging Face. We select 387,004 entries from the dataset based on the maximum length used while training the models.

Transliteration. We use the Dakshina dataset (Roark et al., 2020) from which we select 11,464 sentences based on the maximum length used while training the models.

Translation. We use OPUS-100 (Zhang et al., 2020) (Tiedemann, 2012) for English-to-Urdu and Urdu-to-English translation. We select 755,526 sentences from this dataset based on the maximum length used while training the models.

4 Methodology

4.1 Experimental Design

We utilize a controlled test set consisting of min(1000, len(dataset["test"])) samples for each task, ensuring that the evaluation is both comprehensive and cost-efficient. We ensure that the test dataset is balanced by having an equal representation of different classes within each task. We use Macro- $F_{1}$ Score to evaluate all the classification tasks, SQuAD (Rajpurkar et al., 2018) Rajpurkar et al. (2016) $F_{1}$ to evaluate question-answering, SacreBLEU (Post, 2018) for paraphrasing, transliteration and translation and ROUGE-L (with word-level tokenization) Lin (2004) for summarization.

We fine-tune Llama and XLM-R for each classification task separately, ensuring that each model is specifically optimized for its respective task. Similarly, for generation tasks, we fine-tune the mT5 and Llama separately for each task. For the mT5 models, we use a learning rate of $5e^{-5}$ . The Llama models are fine-tuned with a learning rate of $2e^{-4}$ for both generation and classification tasks. In the case of XLM-R, we use a learning rate of $5e^{-6}$ . We use LoRA (Hu et al., 2021) to fine-tune int4 quantized Llama. The batch size, number of epochs each model is trained for and LoRA config for Llama is given in Appendix E. After fine-tuning all the models we perform evaluation on the test dataset.

To evaluate the performance of the generalist models, we design a series of experiments. We use GPT-4-Turbo and Llama-3-8b-Instruct as our generalist models. GPT-4-Turbo is chosen due its top performance on the LMSYS chatbot arena (Chiang et al., 2024) on of March 1st, 2024 when we started our research. Our experimental setup is as follows: For GPT-4 We conduct experiments under zero-shot, three-shot, and six-shot settings for both generation and classification tasks. The examples for the three-shot and six-shot scenarios are selected from the training dataset of each task. Specifically, the examples are carefully selected by a human expert to ensure a representative and balanced sample, avoiding any unintended bias in the selection process. We also conduct experiments with chain of thought (CoT) reasoning for classification tasks in the six-shot setting. We create CoT reasoning for the six selected examples from the training dataset. These examples along with their CoT is given as a few-shot prompt to the model. Figure 2 presents an example of generated CoT reasoning. For all the evaluations, the temperature and nucleus sampling for GPT-4 is set to 1.0, which is the default for the GPT API.

For Llama, we only perform zero-shot evaluations since it is an instruct model (not fine-tuned for chat). This helps us establish a baseline performance for the generalist models and aids in comparison with the fine-tuned specialist Llama-FT. For Llama, the temperature is set to 0.6 and nucleus sampling is set to 0.9, i.e., the default values in the original code-base for Llama⁸⁸8https://github.com/meta-llama/llama3/blob/main/example_chat_completion.py.

4.2 Prompt Design

Khondaker et al., 2023b observe in their ChatGPT evaluation for Arabic that an English prompt performs better than the Arabic one. Therefore, following this study, we decide to use English prompts for our evaluations of Urdu tasks as well. In the prompt templates, the following placeholders are used:

1.

ROLE: It specifies the persona of the LLM. For example, it could be sentiment classifier, abuse detector, sarcasm detector, etc.
2.

TASK DESCRIPTION: It provides a brief description of what the task is and what the model is expected to do.
3.

LABEL LIST: It lists the possible labels the model can assign to the input text. For example, it could be [’positive’, ’negative’] for sentiment analysis.

Figure 3 shows the prompt template for the classification task without CoT, and Figure 4 shows an example prompt for it. Figure 5 shows the CoT prompt template for the classification task. Figure 6 presents the prompt template for the generation task, and Figure 7 provides an example of it. Appendix G contains all the prompts for classification and generation tasks.

5 Evaluation and Discussion

5.1 Classification

	GPT-4-Turbo				Llama-3-8b-Instruct	XLM-R-large	Llama-3-8b-Instruct-FT
Task	0	3	6	CoT
Sentiment Analysis	90.98	91.17	94.90	94.60	87.88	92.90	95.30
Abuse Detection	86.27	89.01	88.71	87.62	44.73	90.92	89.15
Sarcasm Detection	58.17	69.18	66.56	65.47	44.94	84.37	81.48
Fake News Detection	78.88	75.95	78.14	76.45	66.36	84.99	72.14
Topic Classification	76.05	73.54	74.57	70.43	54.08	84.49	84.74
PoS Tagging	53.31	51.51	54.17	54.61	25.80	65.41	67.55
NER Tagging	61.96	62.18	62.98	63.95	42.13	70.41	90.90

Table 2: We report Macro-

F_{1}

score for each classification task. GPT-4-Turbo is evaluated in 0-shot, 3-shot, 6-shot and 6-shot with chain-of-thought settings. In Llama-3-8b-FT, FT stands for fine-tuned.

We present the evaluation of the generalists (GPT-4, Llama) and the specialists (Llama-FT, XLM-R) for the classification tasks in Table 2. We observe that Llama-FT (fine-tuned) model achieves the highest scores in four tasks (sentiment analysis, topic classification, PoS tagging, and NER tagging), while XLM-R outperforms other models in three tasks (abuse detection, sarcasm detection, and fake news detection). GPT-4 does not perform better than Llama-FT and XLM-R. For certain tasks like NER and PoS tagging, chain-of-thought (CoT) reasoning leads to a better Macro- $F_{1}$ score. We now discuss the performance on each classification task in detail.

Sentiment Analysis. For sentiment analysis, Llama-FT achieves the highest score with a Macro- $F_{1}$ of 95.30, outperforming XLM-R and GPT-4-Turbo. GPT-4’s performance improves with more shots, reaching a Macro- $F_{1}$ of 94.90 with 6-shot learning. The CoT setting provides a slight increase to 94.60. The small difference in the Macro- $F_{1}$ between Llama-FT and GPT-4 indicates the effectiveness of GPT-4 for Urdu sentiment analysis without requiring task-specific fine-tuning.

NER Tagging. For NER tagging, Llama-FT achieves the highest score with a Macro- $F_{1}$ of 90.90, significantly outperforming XLM-R and GPT-4. The best score for GPT-4 is 63.95 with CoT reasoning. We uncover the differences in accuracy for various entities for GPT-4 and Llama-FT in Figure 8. We observe a drop in the recognition accuracy of Person entities by GPT-4, with 95 correctly recognized compared to 139 by Llama-FT. We notice a similar trend for Organization entities, with 70 classified correctly compared to 134 by Llama-FT. This suggests that NER fine-tuning on a specialized Urdu dataset improves the model’s ability to recognize Person and Organization entities in Urdu text.

Abuse Detection. In the task of abuse detection, XLM-R leads with a Macro- $F_{1}$ of 90.92, followed closely by Llama-3-8b-FT at 89.15. GPT-4’s performance peaks at 89.01 with 3-shot learning and slightly decreases with CoT reasoning.

Sarcasm Detection. XLM-R performs best in sarcasm detection with a Macro- $F_{1}$ score of 84.37. Llama-FT also shows strong performance with 81.48. GPT-4’s best performance is observed at 69.18 with 3-shot learning. The large difference between the Macro- $F_{1}$ scores of generalist and specialist models indicates that sarcasm detection in Urdu can be challenging without task-specific fine-tuning.

Fake News Detection. For fake news detection, XLM-R achieves the highest score with a Macro- $F_{1}$ of 84.99, while Llama-FT follows with 72.14. GPT-4’s performance is relatively consistent across different shot settings, with its highest score being 78.88 in the 0-shot setting. The large margin by XLM-R indicates its effectiveness in discerning fake news.

Topic Classification. Llama-FT excels in topic classification with a Macro- $F_{1}$ score of 84.74, significantly higher than the other models. XLM-R also performs well with 84.49.

PoS Tagging. In PoS tagging, Llama-FT leads with a Macro- $F_{1}$ of 67.55, followed by XLM-R at 65.41. GPT-4’s highest score is 53.31 in the 0-shot setting. To understand the significant difference in Macro- $F_{1}$ between GPT-4 and Llama-FT, we study the individual class performance (Figure 9). We find that GPT-4 struggles with correctly tagging proper nouns, pronouns, and auxiliaries, while Llama-FT is able to identify most of them correctly, suggesting that task-specific fine-tuning improves tagging performance for these parts of speech.

	Metric	GPT-4-Turbo			Llama-3-8b-Instruct	mT5-large	Llama-3-8b-Instruct-FT
Task		0	3	6
Question-Answering	SQuAD- $F_{1}$	66.28	72.55	72.78	69.89	69.66	70.42
Summarization	ROUGE-L	22.54	23.35	23.34	25.76	30.72	31.01
Paraphrasing	SacreBLEU	3.59	3.92	4.01	2.42	11.98	10.17
Transliteration	SacreBLEU	30.93	32.38	32.09	12.37	40.23	37.95
Translation (en-ur)	SacreBLEU	12.07	12.62	11.59	5.14	18.35	14.63
Translation (ur-en)	SacreBLEU	16.29	18.05	19.18	7.83	21.55	29.18

Table 3: We report SQuAD-

F_{1}

, ROUGE-L or SacreBLEU depending on the generation tasks. GPT-4 is evaluated in 0-shot, 3-shot and 6-shot setting.

5.2 Generation

We present the evaluation of the generalists and the specialists for the generation tasks in Table 3. We observe that the fine-tuned Llama-FT model achieves the highest scores in several tasks, notably in summarization and translation from Urdu to English. GPT-4 shows a consistent performance with its best results often appearing in the 6-shot setting. The mT5 model also demonstrates strong performance in tasks such as transliteration and paraphrasing, benefiting from its extensive multilingual training on translation tasks. We now discuss the performance on each generation task in detail.

Question-Answering. For question-answering, GPT-4 achieves the highest score with a SQuAD-F1 of 72.78 in the 6-shot setting. Llama-3-8b-FT closely follows with a score of 70.42, indicating the effectiveness of its fine-tuning. mT5 also performs well with a score of 69.66. The consistent improvement of GPT-4 with increasing shots suggests its capability to use more context effectively.

Summarization. In the summarization task, Llama-FT achieves the highest ROUGE-L score of 31.01, significantly outperforming Llama-3-8b and GPT-4. mT5 also shows strong performance with a score of 30.72. GPT-4’s best performance is in the 3-shot setting with a score of 23.35. The fine-tuning of Llama contributes to its superior performance in capturing and summarizing Urdu content effectively.

Paraphrasing. For paraphrasing, mT5 achieves the highest SacreBLEU score of 11.98. This indicates the advantage mT5 has due to its massive multilingual pre-training. Llama-FT follows with a SacreBLEU score of 10.17. GPT-4’s best score is 4.01 in the 6-shot setting.

Transliteration. In the task of transliteration, mT5 leads with a SacreBLEU score of 40.23, followed by Llama-FT at 37.95. GPT-4’s performance peaks at 32.38 with 3-shot learning. Figure 10 shows the words with the highest mismatches in the transliterated text. To count these mismatches, we first tokenize the transliterated sentences and then count the instances where the predicted word differs from the ground truth. Smaller words such as “ye,” “ke,” “aik,” and “wo” pose challenges for GPT-4, resulting in higher mismatch counts. In contrast, mT5 demonstrates lower mismatches for these words.

Translation (en-ur). For English to Urdu translation, mT5 achieves the highest SacreBLEU score of 18.35, indicating its proficiency in translating English to Urdu. Llama-FT follows with a score of 14.63. GPT-4’s performance is consistent, with its highest score being 12.62 in the 3-shot setting. The superior performance of mT5 is likely due to the inclusion of high-quality Urdu data in the multilingual C4 corpus used for its pre-training Xue et al. (2021).

Translation (ur-en). In Urdu to English translation, Llama-FT excels with a SacreBLEU score of 29.18, outperforming other models. Surprisingly, contrary to the results in en-ur translation, mT5 shows a lower SacreBLEU of 21.55 compared to Llama-FT. GPT-4’s highest score is 19.18 in the 6-shot setting.

5.3 Human Evaluators vs. LLMs

We compare human evaluation and LLM-based evaluation for summarization, paraphrasing, transliteration, English to Urdu translation, and Urdu to English translation. We select a subset of size 50 (as done by Khondaker et al., 2023a for Arabic) from the test dataset of each task. Since GPT-4 produces three outputs (0-shot, 3-shot, and 6-shot), we select the one with the highest SacreBLEU or ROUGE-L score. For human evaluation, two annotators (native Urdu speakers) are presented with anonymized outputs of Llama, Llama-FT, mT5 and the selected GPT-4. They are asked to rank them from one to four based on the criteria in Appendix G, allowing multiple models to have the same rank. We prompt GPT-4 and Llama with the same criteria, to assign a score to the outputs of the mentioned tasks, and then ranking is done based on this score. We compare the human ranking, GPT-4 ranking, and Llama ranking using Krippendorff’s alpha to determine the inter-rater reliability presented in Table 4.

Task	A	B	C
Summarization	0.684	0.504	0.502
Paraphrasing	0.710	0.592	0.471
Transliteration	0.694	0.510	0.253
Translation (en-ur)	0.728	0.592	0.392
Translation (ur-en)	0.730	0.474	0.307

Table 4: Column A shows the Krippendorff’s alpha between annotator 1 and annotator 2, B shows the alpha between annotator 1, annotator 2 and GPT-4, and C shows the alpha between annotator 1, annotator 2 and Llama.

For each task, the Krippendorff’s alpha value for human evaluation exceeds 0.67, which, according to Krippendorff’s interpretation is sufficient for a tentative conclusion to be drawn. Table 4 illustrates that the inclusion of GPT-4’s evaluation or Llama’s evaluation significantly reduces the alpha values meaning that the annotations done by GTP-4 and Llama have a low degree of agreement with the human annotators. However there is a higher agreement between the GPT-4 rankings and human rankings as compared to the agreement between Llama rankings and human rankings. We also observe that according to human evaluation GPT-4 outperforms the other models for all the specified tasks. Figure 11 presents the number of times each rank was assigned to GPT-4 for Urdu to English translation. Appendix F contains presents the rank counts for the other tasks.

6 Conclusion

In this paper, we present a comprehensive evaluation of generalist models and specialist models on 7 classification tasks and 6 generation tasks for Urdu NLP. Our evaluation covers prompting techniques such as few-shot, chain-of-thought reasoning as well as the fine-tuning of LLMs. We found that specialist models quantitatively outperformed generalist models on 12 out of the 13 tasks. The results highlight the importance of fine-tuning models to achieve higher performance in domain-specific applications in a low-resource setting. However, generalist models, such as GPT-4, showcased better performance in the human evaluation of the generation tasks, highlighting the importance of qualitative evaluation in accurately assessing model performance. We also performed a LLM based evaluation of the outputs of the models for the generation tasks. The low agreement between the rankings done by LLMs and human rankings shows that LLMS struggle when it comes to low-resource language understanding.

One avenue for future research is to explore other strong general-purpose models (e.g., GPT-4o, Claude Opus) and expand the scope of the evaluation to more Urdu NLP tasks. Additionally, using RAG to find examples from training dataset for few-shot prompting would be an interesting experiment to enhance the performance of generalist models. In conclusion, while specialist models currently hold an edge in task-specific performance, generalist models’ adaptability remains valuable, and continuous advancements in LLMs promise further improvements for low-resource language processing.

7 Limitations

While our study provides valuable insights into the performance of generalist and specialist models for Urdu NLP tasks, it is important to acknowledge several limitations. The question-answering and sentiment analysis datasets used for training the specialist models are translated from English to Urdu. This translation process can introduce inaccuracies that may affect the models’ performance. The evaluation is conducted on a subset of 1000 data points for each task. While this size is manageable and allows for a cost-efficient evaluation, it may not be fully representative of the model’s performance. Although we use multiple annotators and calculate inter-rater reliability using Krippendorff’s alpha, there is still a degree of subjectivity that may influence the results.

8 Ethical Impact

In this paper we present a comprehensive evaluation of LLMs with the aim to enhance the accessibility of NLP applications for Urdu speakers. This has significant ethical implications, as it addresses the digital divide and promotes linguistic diversity in technology. Our findings indicate that specialist models perform better than the generalist models in most Urdu NLP tasks. Consequently, our work may inspire researchers to develop more resources for the Urdu language, including models and datasets.

The potential risks associated with the usage of LLMs include the amplification of existing biases present in their training data, which may lead to unfair and discriminatory outcomes (Ye et al., 2023). A comprehensive fairness evaluation of these models must be conducted before they are deployed for public use.

References

Abdelali et al. (2024) Ahmed Abdelali, Hamdy Mubarak, Shammur Absar Chowdhury, Maram Hasanain, Basel Mousi, Sabri Boughorbel, Yassine El Kheir, Daniel Izham, Fahim Dalvi, Majd Hawasly, Nizi Nazar, Yousseif Elshahawy, Ahmed Ali, Nadir Durrani, Natasa Milic-Frayling, and Firoj Alam. 2024. Larabench: Benchmarking arabic ai with large language models. Preprint, arXiv:2305.14982.
Ahuja et al. (2023) Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. 2023. Mega: Multilingual evaluation of generative ai. Preprint, arXiv:2303.12528.
Ahuja et al. (2024) Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. 2024. Megaverse: Benchmarking large language models across languages, modalities, models and tasks. Preprint, arXiv:2311.07463.
Akhter et al. (2020) Muhammad Pervez Akhter, Zheng Jiangbin, Irfan Raza Naqvi, Mohammed Abdelmajeed, and Muhammad Tariq Sadiq. 2020. Automatic detection of offensive language for urdu and roman urdu. IEEE Access, 8:91213–91226.
Amjad et al. (2020) Maaz Amjad, Grigori Sidorov, and Alisa Zhila. 2020. Data augmentation using machine translation for fake news detection in the Urdu language. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2537–2542, Marseille, France. European Language Resources Association.
Amjad et al. (2022) Maaz Amjad, Alisa Zhila, Grigori Sidorov, Andrey Labunets, Sabur Butta, Hamza Imam Amjad, Oxana Vitman, and Alexander Gelbukh. 2022. Overview of abusive and threatening language detection in urdu at fire 2021. Preprint, arXiv:2207.06710.
Arif et al. (2024) Samee Arif, Sualeha Farid, Awais Athar, and Agha Ali Raza. 2024. UQA: Corpus for Urdu question answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17237–17244, Torino, Italia. ELRA and ICCL.
Blasi et al. (2022) Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Preprint, arXiv:2005.14165.
Chiang et al. (2024) Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. Preprint, arXiv:2106.09685.
Khan and Najeeb (2023) Shumaila Khan and Fahad Najeeb. 2023. Urdu sarcastic tweets dataset.
Khondaker et al. (2023a) Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, and Muhammad Abdul-Mageed. 2023a. Gptaraeval: A comprehensive evaluation of chatgpt on arabic nlp. Preprint, arXiv:2305.14976.
Khondaker et al. (2023b) Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, and Muhammad Abdul-Mageed. 2023b. GPTAraEval: A comprehensive evaluation of ChatGPT on Arabic NLP. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 220–247, Singapore. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
Mujadia et al. (2024) Vandan Mujadia, Pruthwik Mishra, Arafat Ahsan, and Dipti Misra Sharma. 2024. Towards large language model driven reference-less translation evaluation for english and indian languages. Preprint, arXiv:2404.02512.
Munaf et al. (2023) Mubashir Munaf, Hammad Afzal, Naima Iltaf, and Khawir Mahmood. 2023. Low resource summarization using pre-trained language models. Preprint, arXiv:2310.02790.
Nivre et al. (2020) Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.
OpenAI (2024) OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. Preprint, arXiv:1804.08771.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Roark et al. (2020) Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Işin Demirşahin, and Keith Hall. 2020. Processing South Asian languages written in the Latin script: the Dakshina dataset. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 2413–2423.
Singh et al. (2024) Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. 2024. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages. Preprint, arXiv:2404.16816.
Tahir et al. (2024) Munief Hassan Tahir, Sana Shams, Layba Fiaz, Farah Adeeba, and Sarmad Hussain. 2024. Benchmarking pre-trained large language models’ potential across urdu nlp tasks. Preprint, arXiv:2405.15453.
Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Ye et al. (2023) Wentao Ye, Mingfeng Ou, Tianyi Li, Yipeng chen, Xuetao Ma, Yifan Yanggong, Sai Wu, Jie Fu, Gang Chen, Haobo Wang, and Junbo Zhao. 2023. Assessing hidden risks of llms: An empirical study on robustness, consistency, and credibility. Preprint, arXiv:2305.10235.
Zhang et al. (2020) Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online. Association for Computational Linguistics.
Zhang et al. (2023) Yue Zhang, Ming Zhang, Haipeng Yuan, Shichun Liu, Yongyao Shi, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023. Llmeval: A preliminary study on how to evaluate large language models. Preprint, arXiv:2312.07398.
Zhao et al. (2024) Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. Llama beyond english: An empirical study on language capability transfer. Preprint, arXiv:2401.01055.

Appendix A Implementation Details

A.1 Models

XLM-Roberta-large is availabe on Hugging Face⁹⁹9https://huggingface.co/FacebookAI/xlm-roberta-large under MIT license. mT5-large is available on Hugging Face¹⁰¹⁰10https://huggingface.co/google/mt5-large under Apache-2.0 license. Llama-3-8b-Instruct is available on Hugging Face¹¹¹¹11https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct under llama3 license. GPT-4 is available under proprietary licence. All models used in this paper comply with their respective license.

A.2 Datasets

Urdu IMDB sentiment analysis dataset OBDL license. Urdu NER dataset is availabe under MIT license. The abuse detection dataset by Akhter et al., 2020 and by Amjad et al., 2022, Sarcastic Tweets Dataset (Khan and Najeeb, 2023), fake news dataset by Amjad et al., 2020, Topic Classification dataset, and Universal Dependencies (Nivre et al., 2020) are available under CC BY 4.0.

UQuAD dataset is under CC0-1.0 while UQA (Arif et al., 2024) is under CC BY 4.0. XSUMUrdu (Munaf et al., 2023) summarization dataset is also under CC BY 4.0 license. Paraphrasing dataset is under MIT license. Dakshina dataset (Roark et al., 2020) for transliteration is under CC BY-SA 4.0 and OPUS-100 (Zhang et al., 2020) (Tiedemann, 2012) for translation is under GPL-3.0 license.

All datasets used in this paper comply with their respective license.

Appendix B Model Size and Budget

We fine-tuned XLM-Roberta-large for classification tasks which has 550 million parameters. We fine-tuned mT5-large for generation tasks which has 1.2 billion parameters. Llama-3-8b-Instruct has 8 billion parameters and is fine-tuned for both generation and classification tasks.

Nvidia A100 80GB, Nvidia A100 40GB and Nvidia RTX 6000Ada 48GB were used for fine-tuning of the models. Infernece was done on Nvidia RTX 6000Ada 48GB and Nvidia RTX 4090 24GB. Total GPU time was approximately 200 hours.

Appendix C Human Annotators

There are two human annotators in this study: one is the author of this paper (Computer Science graduate), and the other is a research intern (Computer Science senior). Both annotators are native speakers of Urdu from Pakistan. The research intern was informed about how the data would be used for the evaluation of LLMs for Urdu.

Appendix D Dataset Size

This section provides details about the datasets used for the classification and generation tasks evaluated in this study. The test dataset sizes for each task are summarized in the tables below.

D.1 Classification

Table 5 shows the test dataset sizes for classification tasks including Sentiment Analysis, Abuse Detection, Sarcasm Detection, Fake News Detection, Topic Classification, Part-of-Speech (PoS) Tagging, and Named Entity Recognition (NER) Tagging.

Task	Test Size
Sentiment Analysis	1000
Abuse Detection	567
Sarcasm Detection	1000
Fake News Detection	130
Topic Classification	1000
PoS Tagging	1000
NER Tagging	1000

Table 5: Test dataset size for each classification task

D.2 Generation

Table 6 shows the test dataset sizes for generation tasks including Summarization, Paraphrasing, Transliteration, English to Urdu Translation, and Urdu to English Translation.

Task	Test Size
Question Answering	1000
Summarization	568
Paraphrasing	1000
Transliteration	1000
Translation (en-ur)	1000
Translation (ur-en)	1000

Table 6: Test dataset size for each generation task

Appendix E Reproducibility and Hyperparameter

The table 7 presents the Lora configuration that we use for fine-tuning Llama.

Rank	16
Alpha	16
Target Modules	q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj

Table 7: Lora config for Llama fine-tuning

The table 8 presents the training parameters, including the number of epochs and batch sizes, we use for fine-tuning XLM-R, mT5, and Llama.

Table 8: Training Parameters for Different Models. E represents number of epochs and B represents the batch size.

	XLM-R		mT5		Llama
Task	E	B	E	B	E	B
Sentiment Analysis	1	16	-	-	4	1
Abuse Detection	4	16	-	-	3	8
Sarcasm Detection	4	16	-	-	3	8
Fake News Detection	6	8	-	-	3	1
Topic Classification	4	32	-	-	3	8
POS Tagging	7	16	-	-	1	4
NER Tagging	10	16	-	-	3	4
Question Answering	-	-	4	8	1	1
Summarization	-	-	1	4	2	1
Paraphrasing	-	-	3	3	4	4
Transliteration	-	-	5	8	2	4
Translation-en-ur	-	-	2	8	1	8
Translation	-	-	3	8	1	8

Appendix F Human Evaluation

For the summarization task, GPT was assigned a rank one by annotator one 41 times and by annotator two 34 times as shown in figure 12. In the paraphrasing task, figure 13 shows that GPT was chosen as the best model 42 times by annotator one and 39 times by annotator two. For transliteration, GPT was selected as the best model 40 times by annotator one and 38 times by annotator two as shown in figure 14. In the English to Urdu translation task, figure 15 shows that GPT was the top model 46 times for both annotators. For Urdu to English translation, GPT was rated as the best model 40 times by annotator one and 45 times by annotator two as presented in figure 11.

For the summarization task, annotator one assigned Llama a rank of one 9 times, while annotator two assigned it 17 times. In the paraphrasing task, annotator one selected Llama as the best model 12 times, and annotator two chose it 15 times. For transliteration, Llama was selected as the best model 5 times by annotator one and 6 times by annotator two. In the English to Urdu translation task, annotator one ranked Llama as the top model 3 times, and annotator two ranked it 1 time. For the Urdu to English translation task, annotator one assigned Llama rank one 12 times, while annotator two assigned it 10 times.

For the summarization task, annotator one assigned Llama-FT a rank of one 7 times, while annotator two assigned it 12 times. In the paraphrasing task, annotator one selected Llama-FT as the best model 4 times, and annotator two chose it 3 times. For transliteration, Llama-FT was selected as the best model 13 times by annotator one and 21 times by annotator two. In the English to Urdu translation task, annotator one ranked Llama-FT as the top model 7 times and annotator two ranked it 10 times. For the Urdu to English translation task, annotator one favored Llama-FT 14 times, while annotator two favored it 15 times.

For the summarization task, mT5 was assigned a rank one by annotator one 4 times and by annotator two 11 times. In the paraphrasing task, mT5 was picked as the best model 4 times by both annotators. For transliteration, mT5 was selected as the best model 19 times by annotator one and 21 times by annotator two. In the English to Urdu translation task, mT5 was chosen as the top model 13 times by annotator one and 10 times by annotator two. For the Urdu to English translation task, mT5 was rated as the best model 18 times by annotator one and 15 times by annotator two. Figure 16 presents a detailed comparison of rank counts for GPT-4, Llama, Llama-FT and mT5 in human evaluation and LLM evaluation of various NLP tasks.

Appendix G Prompts and Evaluation Criteria

Table LABEL:fig:classprompts contains the CoT prompts used for the classification tasks. The prompts for classification without CoT are the same, except that the reasoning is not generated, and only the label is required in the output. Table LABEL:fig:genprompts contains the prompts used for the generation tasks. Table LABEL:fig:eval-criteria presents the criteria given to human evaluators, GPT-4 and Llama for assessing the quality of outputs for summarization, paraphrasing, transliteration, English to Urdu translation, and Urdu to English translation. For each evaluation, the LLMs are asked to reason the answer to improve the scoring of the outputs.

Table 9: The table presents the CoT prompts used for classification tasks.

Classification Task	CoT Prompt
Sentiment Analysis	You are an Urdu sentiment classifier. The input text should be labeled according to its sentiment. The label list is: [’positive’, ’negative’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
Abuse Detection	You are an Urdu abuse detector. The input text should be labeled according to whether it is abusive or not. The label list is: [’abusive’, ’not abusive’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
Sarcasm Detection	You are an Urdu sarcasm detector. The input text should be labeled according to whether it is sarcastic or not. The label list is: [’sarcastic’, ’not sarcastic’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
Fake News Detection	You are an Urdu fake news detector. The input text should be labeled according to whether it is fake news or not. The label list is: [’fake news’, ’not fake news’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
Topic Classification	You are an Urdu topic classifier. The input text should be assign a label from the label list: [’business’, ’entertainment’, ’health’, ’politics’, ’science’, ’sports’, ’world’, ’other’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
PoS Tagging	You are an Urdu part-of-speech tagger. The word wrapped in <hl> tag should be assigned a PoS tag from the label list: [’noun’, ’punctuation mark’, ’adposition’, ’number’, ’symbol’, ’subordinating conjunction’, ’adjective’, ’particle’, ’determiner’, ’coordinating conjunction’, ’proper noun’, ’pronoun’, ’other’, ’adverb’, ’interjection’, ’verb’, ’auxiliary verb’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}
NER Tagging	You are an Urdu named entity recognizer. The word wrapped in <hl> tag should be assigned a NER tag from the label list: [’time’, ’person’, ’organization’, ’number’, ’location’, ’designation’, ’date’, ’other’] Use chain of thought (cot) to reason your answer. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"cot": ..., "label": ...}

Table 10: The table presents the prompts used for generation tasks.

Generation Task	Prompt
Summarization	You are a text summarizer. Your task is to summarize the given Pakistani Urdu text in 1 to 2 sentences. The summary should be in Urdu. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"summary": ...}
Paraphrasing	You are a text paraphraser. Your task is to paraphrase the given Pakistani Urdu text. The paraphrased text should be in Urdu. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"paraphrase": ...}
Transliteration	You are a machine transliterator. Your task is to transliterate the given Pakistani Urdu text to Roman Urdu. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"transliteration": ...}
Translation (en-ur)	You are a machine translator. Your task is to translate the given English text to Pakistani Urdu. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"translation": ...}
Translation (ur-en)	You are a machine translator. Your task is to translate the given Pakistani Urdu text to English. ALWAYS RETURN JSON OBJECT IN FOLLOWING FORMAT ONLY: {"translation": ...}

Table 11: The table presents the criteria used to evaluate the outputs of the generation task in the form of an LLM prompt.

Task	Prompt/Criteria
Summarization	You are a Pakistani Urdu language expert tasked with evaluating the quality of the summary produced by the summarization model. Score the given output with respect to the given input on a continuous score from 0 to 100 based on the following criteria: 1. Includes main points and key information from the original text 2. No grammatical errors 3. Conveys information in a brief manner 4. Gives correct information based on the original text Think step by step and use reasoning. ALWAYS RETURN JSON OBJECT IN THE FOLLOWING FORMAT ONLY: {"reasoning": ..., "score": ...}
Paraphrasing	You are a Pakistani Urdu language expert tasked with evaluating the quality of paraphrased text produced by the paraphrasing model. Score the given output with respect to the given input on a continuous score from 0 to 100 based on the following criteria: 1. Retains the original meaning and key ideas 2. No grammatical errors 3. Use of different words and phrases than the original text Think step by step and use reasoning. ALWAYS RETURN JSON OBJECT IN THE FOLLOWING FORMAT ONLY: {"reasoning": ..., "score": ...}
Transliteration	You are a Pakistani Urdu language expert tasked with evaluating the quality of transliterated text produced by the transliteration model. Score the given output with respect to the given input on a continuous score from 0 to 100 based on the following criteria: 1. Correctness of words (keeping in mind that different words can have different spelling) 2. Proper capitalization of words Think step by step and use reasoning. ALWAYS RETURN JSON OBJECT IN THE FOLLOWING FORMAT ONLY: {"reasoning": ..., "score": ...}
Translation (en-ur)	You are a Pakistani Urdu language expert tasked with evaluating the quality of translated text produced by the translation model. Score the given output with respect to the given input on a continuous score from 0 to 100 based on the following criteria: 1. Conveys the meaning of the original text without omissions or additions 2. No grammatical errors 3. Retains the style and tone of the original text Think step by step and use reasoning. ALWAYS RETURN JSON OBJECT IN THE FOLLOWING FORMAT ONLY: {"reasoning": ..., "score": ...}
Translation (ur-en)	You are a Pakistani Urdu language expert tasked with evaluating the quality of translated text produced by the translation model. Score the given output with respect to the given input on a continuous score from 0 to 100 based on the following criteria: 1. Conveys the meaning of the original text without omissions or additions 2. No grammatical errors 3. Retains the style and tone of the original text Think step by step and use reasoning. ALWAYS RETURN JSON OBJECT IN THE FOLLOWING FORMAT ONLY: {"reasoning": ..., "score": ...}