Search | arXiv e-print repository

ALLaM: Large Language Models for Arabic and English

Authors: M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan

Abstract: We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture… ▽ More We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2302.06227 [pdf, other]

Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages

Authors: Sudhanshu Srivastava, Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN… ▽ More Hidden-Markov-model (HMM) based text-to-speech (HTS) offers flexibility in speaking styles along with fast training and synthesis while being computationally less intense. HTS performs well even in low-resource scenarios. The primary drawback is that the voice quality is poor compared to that of E2E systems. A hybrid approach combining HMM-based feature generation and neural-network-based HiFi-GAN vocoder to improve HTS synthesis quality is proposed. HTS is trained on high-resolution mel-spectrograms instead of conventional mel generalized coefficients (MGC), and the output mel-spectrogram corresponding to the input text is used in a HiFi-GAN vocoder trained on Indic languages, to produce naturalness that is equivalent to that of E2E systems, as evidenced from the DMOS and PC tests. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: 5 pages, 5 figures

arXiv:2212.11982 [pdf, other]

HMM-based data augmentation for E2E systems for building conversational speech synthesis systems

Authors: Ishika Gupta, Anusha Prakash, Jom Kuriakose, Hema A. Murthy

Abstract: This paper proposes an approach to build a high-quality text-to-speech (TTS) system for technical domains using data augmentation. An end-to-end (E2E) system is trained on hidden Markov model (HMM) based synthesized speech and further fine-tuned with studio-recorded TTS data to improve the timbre of the synthesized voice. The motivation behind the work is that issues of word skips and repetitions… ▽ More This paper proposes an approach to build a high-quality text-to-speech (TTS) system for technical domains using data augmentation. An end-to-end (E2E) system is trained on hidden Markov model (HMM) based synthesized speech and further fine-tuned with studio-recorded TTS data to improve the timbre of the synthesized voice. The motivation behind the work is that issues of word skips and repetitions are usually absent in HMM systems due to their ability to model the duration distribution of phonemes accurately. Context-dependent pentaphone modeling, along with tree-based clustering and state-tying, takes care of unseen context and out-of-vocabulary words. A language model is also employed to reduce synthesis errors further. Subjective evaluations indicate that speech produced using the proposed system is superior to the baseline E2E synthesis approach in terms of intelligibility when combining complementing attributes from HMM and E2E frameworks. The further analysis highlights the proposed approach's efficacy in low-resource scenarios. △ Less

Submitted 22 December, 2022; originally announced December 2022.

Comments: 6 pages, 7 figures, 33 references

arXiv:2211.01338 [pdf, other]

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

Authors: Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M, Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti Sharma, Hema Murthy, Pushpak Bhattacharya , et al. (2 additional authors not shown)

Abstract: Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages… ▽ More Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2106.01400 [pdf, other]

Dual Script E2E framework for Multilingual and Code-Switching ASR

Authors: Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema Murthy

Abstract: India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in… ▽ More India is home to multiple languages, and training automatic speech recognition (ASR) systems for languages is challenging. Over time, each language has adopted words from other languages, such as English, leading to code-mixing. Most Indian languages also have their own unique scripts, which poses a major limitation in training multilingual and code-switching ASR systems. Inspired by results in text-to-speech synthesis, in this work, we use an in-house rule-based phoneme-level common label set (CLS) representation to train multilingual and code-switching ASR for Indian languages. We propose two end-to-end (E2E) ASR systems. In the first system, the E2E model is trained on the CLS representation, and we use a novel data-driven back-end to recover the native language script. In the second system, we propose a modification to the E2E model, wherein the CLS representation and the native language characters are used simultaneously for training. We show our results on the multilingual and code-switching tasks of the Indic ASR Challenge 2021. Our best results achieve 6% and 5% improvement (approx) in word error rate over the baseline system for the multilingual and code-switching tasks, respectively, on the challenge development data. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted for publication at Interspeech 2021

arXiv:2103.03215 [pdf, other]

Front-end Diarization for Percussion Separation in Taniavartanam of Carnatic Music Concerts

Authors: Nauman Dawalatabad, Jilt Sebastian, Jom Kuriakose, C. Chandra Sekhar, Shrikanth Narayanan, Hema A. Murthy

Abstract: Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source se… ▽ More Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source separation algorithms assume the presence of multiple percussive voices throughout the audio segment. We prevent this by first subjecting the taniavartanam to diarization. This process results in homogeneous clusters consisting of segments of either a single voice or multiple voices. A cluster of segments with multiple voices is identified using the Gaussian mixture model (GMM), which is then subjected to source separation. A deep recurrent neural network (DRNN) based approach is used to separate the multiple instrument segments. The effectiveness of the proposed system is evaluated on a standard Carnatic music dataset. The proposed approach provides close-to-oracle performance for non-overlapping segments and a significant improvement over traditional separation schemes. △ Less

Submitted 4 March, 2021; originally announced March 2021.

arXiv:1807.05962 [pdf, other]

Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings

Authors: Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, Roger Zimmermann, John R. Talburt

Abstract: Keyword extraction is a fundamental task in natural language processing that facilitates mapping of documents to a concise set of representative single and multi-word phrases. Keywords from text documents are primarily extracted using supervised and unsupervised approaches. In this paper, we present an unsupervised technique that uses a combination of theme-weighted personalized PageRank algorithm… ▽ More Keyword extraction is a fundamental task in natural language processing that facilitates mapping of documents to a concise set of representative single and multi-word phrases. Keywords from text documents are primarily extracted using supervised and unsupervised approaches. In this paper, we present an unsupervised technique that uses a combination of theme-weighted personalized PageRank algorithm and neural phrase embeddings for extracting and ranking keywords. We also introduce an efficient way of processing text documents and training phrase embeddings using existing techniques. We share an evaluation dataset derived from an existing dataset that is used for choosing the underlying embedding model. The evaluations for ranked keyword extraction are performed on two benchmark datasets comprising of short abstracts (Inspec), and long scientific papers (SemEval 2010), and is shown to produce results better than the state-of-the-art systems. △ Less

Submitted 16 July, 2018; originally announced July 2018.

Comments: preprint for paper accepted in Proceedings of 1st IEEE International Conference on Multimedia Information Processing and Retrieval

arXiv:1412.2857 [pdf]

Analysis of Maximum Likelihood and Mahalanobis Distance for Identifying Cheating Anchor Nodes

Authors: Jeril Kuriakose, Amruth V., Sandesh A. G., Jampu Venkata Naveenbabu, Mohammed Shahid, Ashish Shetty

Abstract: Malicious anchor nodes will constantly hinder genuine and appropriate localization. Discovering the malicious or vulnerable anchor node is an essential problem in wireless sensor networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighboring nodes or non-anchor nodes calculate its location (or its location reference) with the help of anchor nod… ▽ More Malicious anchor nodes will constantly hinder genuine and appropriate localization. Discovering the malicious or vulnerable anchor node is an essential problem in wireless sensor networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighboring nodes or non-anchor nodes calculate its location (or its location reference) with the help of anchor nodes. Ingenuous localization is not possible in the presence of a cheating anchor node or a cheating node. Nowadays, its a challenging task to identify the cheating anchor node or cheating node in a network. Even after finding out the location of the cheating anchor node, there is no assurance, that the identified node is legitimate or not. This paper aims to localize the cheating anchor nodes using trilateration algorithm and later associate it with maximum likelihood expectation technique (MLE), and Mahalanobis distance to obtain maximum accuracy in identifying malicious or cheating anchor nodes during localization. We were able to attain a considerable reduction in the error achieved during localization. For implementation purpose we simulated our scheme using ns-3 network simulator. △ Less

Submitted 9 December, 2014; originally announced December 2014.

Comments: 10 pages, 13 pages, conference

arXiv:1411.5465 [pdf]

Identifying Cheating Anchor Nodes using Maximum Likelihood and Mahalanobis Distance

Authors: Jeril Kuriakose, V. Amruth, Swathy Nandhini, V. Abhilash

Abstract: Malicious anchor nodes will constantly hinder genuine and appropriate localization. Discovering the malicious or vulnerable anchor node is an essential problem in Wireless Sensor Networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighbouring nodes or non-anchor nodes calculate its location (or its location reference) with the help of anchor no… ▽ More Malicious anchor nodes will constantly hinder genuine and appropriate localization. Discovering the malicious or vulnerable anchor node is an essential problem in Wireless Sensor Networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighbouring nodes or non-anchor nodes calculate its location (or its location reference) with the help of anchor nodes. Ingenuous localization is not possible in the presence of a cheating anchor node or a cheating node. Nowadays, it's a challenging task to identify the cheating anchor node or cheating node in a network. Even after finding out the location of the cheating anchor node, there is no assurance, that the identified node is legitimate or not. This paper aims to localize the cheating anchor nodes using trilateration algorithm and later associate it with maximum likelihood expectation technique (MLE), and Mahalanobis distance to obtain maximum accuracy in identifying malicious or cheating anchor nodes during localization. We were able to attain a considerable reduction in the error achieved during localization. For implementation purpose we simulated our scheme using ns-3 network simulator. △ Less

Submitted 20 November, 2014; originally announced November 2014.

Comments: 12 pages, 18 figures, IJSP. arXiv admin note: substantial text overlap with arXiv:1411.4437

arXiv:1411.4437 [pdf]

Sequestration of Malevolent Anchor Nodes in Wireless Sensor Networks using Mahalanobis Distance

Authors: Jeril Kuriakose, V. Amruth, Swathy Nandhini, V. Abhilash

Abstract: Discovering the malicious or vulnerable anchor node is an essential problem in wireless sensor networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighbouring nodes or non-anchor nodes calculate its location coordinate (or location reference) with the help of anchor nodes. Ingenuous localization is not possible in the presence of a cheating anc… ▽ More Discovering the malicious or vulnerable anchor node is an essential problem in wireless sensor networks (WSNs). In wireless sensor networks, anchor nodes are the nodes that know its current location. Neighbouring nodes or non-anchor nodes calculate its location coordinate (or location reference) with the help of anchor nodes. Ingenuous localization is not possible in the presence of a cheating anchor node or a cheating node. Nowadays, its a challenging task to identify the cheating anchor node or cheating node in a network. Even after finding out the location of the cheating anchor node, there is no assurance, that the identified node is legitimate or not. This paper aims to localize the cheating anchor nodes using trilateration algorithm and later associate it with Mahalanobis distance to obtain maximum accuracy in detecting malicious or cheating anchor nodes during localization. We were able to attain a considerable reduction in the error achieved during localization. For implementation purpose, we simulated our scheme using ns3 network simulator. △ Less

Submitted 17 November, 2014; originally announced November 2014.

Comments: 9 pages, 9 figures, ICC conference proceedings

arXiv:1410.8713 [pdf]

Localization in Wireless Sensor Networks: A Survey

Authors: Jeril Kuriakose, Sandeep Joshi, V. I. George

Abstract: Localization is widely used in Wireless Sensor Networks (WSNs) to identify the current location of the sensor odes. A WSN consist of thousands of nodes that make the installation of GPS on each sensor node expensive and moreover GPS may not provide exact localization results in an indoor environment. Manually configuring location reference on each sensor node is also not possible for dense network… ▽ More Localization is widely used in Wireless Sensor Networks (WSNs) to identify the current location of the sensor odes. A WSN consist of thousands of nodes that make the installation of GPS on each sensor node expensive and moreover GPS may not provide exact localization results in an indoor environment. Manually configuring location reference on each sensor node is also not possible for dense network. This gives rise to a problem where the sensor nodes must identify its current location without using any special hardware like GPS and without the help of manual configuration. In this paper we review the localization techniques used by wireless sensor nodes to identify their current location. △ Less

Submitted 31 October, 2014; originally announced October 2014.

Comments: 3 papes, 3 figures, conference proceedings

Showing 1–11 of 11 results for author: Kuriakose, J