ManWav: The First Manchu ASR Model

Jean Seo, Minha Kang, Sungjoo Byun, Sangah Lee
Seoul National University
{seemdog, alsgk1123, byunsj, sanalee}@snu.ac.kr

Abstract

This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.

1 Introduction

The landscape of Automatic Speech Recognition (ASR) research has centered around high resource languages such as English. This concentrated attention on high resource languages has deepened the divide between research advancements. While research on English ASR encompasses diverse linguistic variations, including accented and noised speech, the same cannot be said for many low resource languages, though a few basic research including Safonova et al. (2022) and Zhou et al. (2022) exist. Astonishingly, not a single basic ASR model has been developed for Manchu to date, highlighting a critical void in linguistic inclusivity within the realm of ASR technology.

The development of a Manchu ASR model holds particular importance in the field of linguistics, as there are no more native speakers of Manchu. Consequently, the available data, whether text or audio, for linguistic study is limited and cannot be replenished. Therefore, it is crucial to maximize the utilization of existing data. However, due to the scarcity of individuals capable of transcribing Manchu audio data, unlabeled data remain unused. If transcribed, this data could prove to be invaluable resource for Manchu research and preservation. Even though the performance of the Manchu ASR system may not be perfect, it would be immensely helpful if it could provide draft transcriptions. This would enable researchers to revise and incorporate them into their studies.

This paper sets out to address the significant gap between high and low resource languages by developing the inaugural Manchu ASR model. This endeavor is underscored by the scarcity of linguistic resources, prompting us to collect all existing Manchu audio data from Kim et al. (2008) in one channel. We try to maximize the cross-lingual capabilities of Wav2Vec2-XLSR-53 (Conneau et al., 2020) by fine-tuning the model with Manchu audio data. The performance of the Manchu ASR model is further enhanced through data augmentation.

The contributions of this study are as follows:

•

Collecting Manchu audio data in an unified format and correcting corresponding transcriptions
•

Developing the very first Manchu ASR model with augmented data

2 Manchu Language

The Manchu language, a member of the Tungusic linguistic family, has its roots among the Manchu people of Northeast China and boasts a significant historical role as the official language of the Qing dynasty (1644-1912). Presently, the language confronts a dire state of endangerment, officially denoted a dead language with no more native speakers left.

There have been some efforts to employ technological solutions in the preservation and revitalization of Manchu. These endeavors include the Manchu spell checker (You, 2014), Manchu-Korean machine translation (Seo et al., 2023), and Manchu NER/POS tagging models (Lee et al., 2024). However, due to the paucity of data, the studies above face challenges and no ASR model has been yet developed.

3 Data

3.1 Materials

This study leverages Colloquial Manchu data provided by Kim et al. (2008), in which Colloquial Manchu data is gathered as part of ASK REAL project (Altaic Society of Korea, Researches on Endangered Altaic Languagess (Choi et al., 2012)). This audio data represents the dialect of Sanjiazi village, located in the Youyi Dowoerzu Manzu Ke’er-kezizu township, Fuyu county, Heilongjiang Province.

The recording took place from February 7th to 14th, 2006 in Qiqihar, Heilongjiang Province, with Mr. Meng Xianxiao (73 years old at that moment). Though Chinese being his first language, Mr. Meng Xianxiao sufficiently served as the speaker, acquiring a comprehensive ability of Manchu by the age of 12.

The data we use in this study is the recordings of the basic conversational expressions and the sentences for grammatical analysis. The length of each recording is 32 minutes and 58 minutes, for a total of 90 minutes. Corresponding transcriptions are basically provided by Kim et al. (2008) and went through some revisions by a Manchu researcher from Seoul National University for better precision.

3.2 Transcription

The phoneme transcription system in this study is based on Kim et al. (2008). While it shares similarities with the International Phonetic Alphabet (IPA), our system incorporates some distinctions. Specifically, /b, d, g/ represent voiceless unaspirated stops, and /p, t, k/ denote voiceless aspirated stops. Notably, Colloquial Manchu lacks voiced stops, making this transcription system more practical than using diacritic /^h/ to indicate aspiration. Next, /ǰ, č, š/ denote voiceless palatal sounds. In IPA system, corresponding sound symbols are [, ç, ]. But /ǰ/ is not voiced unlike [], and /č/ is the aspirated sound, [č^h]. Some examples can be found in Table 1.

Transcription	IPA
mi\textipaŋ nj bitk sw.	mi\textipaŋ ni pitk sw.
(Translation: My mother is a teacher.)
došn ǰo.	to\textipa zn d\textipa zo.
(Translation: Come on in.)

Table 1: Examples of our transcription, IPA, and corresponding translation.

3.3 Data Augmentation

The scarcity of speech datasets from native Manchu speakers presents a significant challenge, necessitating the adoption of various data augmentation methods. Audio data augmentation methods used to simulate different acoustic environments include:

•

Additive noise: Adding background noise to the audio samples.
•

Clipping: Involves cutting short the audio signals.
•

Reverberation: Applying reverberation effects.
•

Time dropout: Randomly removing segments of the audio.

By implementing the above techniques through WavAugment¹¹1https://github.com/facebookresearch/WavAugment provided by Kharitonov et al. (2020), we expand the dataset by 100% respectively, to a total of 400%, significantly enriching the available train data. Notable is the fact that data augmentation is implemented after the separation of train and test data, ensuring more reliable test results by preventing overlap between the train and test sets. The size of data before and after augmentation is described in Table 2.

Before Augmentation	Duration
train	81 min
test	9.5 min
After Augmentation	Duration
train	326.5 min
test	9.5 min

Table 2: The duration of audio files(.wav) in minutes before and after augmentation.

4 Experiment

4.1 Models

Wav2Vec2-XLSR-53 (Conneau et al., 2020) is utilized as the base model. Wav2Vec2-XLSR-53 is a multilingual self-supervised learning (SSL) model from Meta AI²²2https://ai.meta.com/ pre-trained with 53 languages. A Wav2Vec2-XLSR-53 model is fine-tuned in two different types of data, leading to two separate fine-tuned models: one with original Manchu data, and the other with augmented Manchu data. We name the model trained with augmented data ManWav. The fine-tuning process is conducted through HuggingSound (Grosman, 2022).

4.2 Experimental Setup

Our experiments are conducted using an NVIDIA A100 GPU. We fine-tune our models with learning rate 3e-4, batch size 16, and dropout rate of 0.1. We train Wav2Vec2-XLSR-53 with 400% augmented data for 1 epoch. On the other hand, Wav2Vec2-XLSR-53 with original data is trained for 5 epochs, ensuring identical train data size for fair comparison.

5 Result and Discussion

5.1 Result

We use Character Error Rate (CER) and Word Error Rate (WER) as evaluation metrics. CER assesses the accuracy of character transcription, while WER measures the correctness of word recognition. Scores closer to 0 represent better performances in both metrics. WER and CER are the most common and essential metrics in gauging the overall performance of ASR systems.

The experimental results prove the significance of data augmentation in fine-tuning the base model. As depicted in Table 3, using augmented data at the training stage clearly improves the performance, specifically dropping CER by 0.02 and WER by 0.13, indicating the effectiveness using augmented data described in Section 3.3.

Moreover, Table 4 shows the promising capabilities of ManWav in the Manchu speech recognition task. The achieved accuracy is particularly noteworthy given the limited availability of Manchu speech data and considering that Wav2Vec2-XLSR-53 is not initially pre-trained on Manchu.

Data Augmentation	CER	WER
before	0.13	0.44
after	0.11	0.31

Table 3: The performance of Wav2Vec2-XLSR-53 each trained with data before and after augmentation.

Model Prediction	Actual Transcription
si jawuči bi gl jaam si jawuči bi gl jaam	si jawuči bi gl jaam si jawuči bi gl jaam
tl am dulk ani mk iči bo alx	tl am dulk ani mk iči bo alx
bi sajw wak bi sajw wak	bi sajw wak bi sajw wak
bi sisk bitk xolal ba d jom mutulko	bi sisk bitk xolal ba d jom mutulko
min do bitk xolal ba joxo	min do bitk xolal ba joxo
odun gjak šaxulo odun gjak šaxulo	odun gjak šawulo odun gjak šawulo

Table 4: Examples of inference results from ManWav. Wrong predictions are marked red and the corresponding answers are marked blue.

5.2 Linguistic Analysis

Taking into account the linguistic characteristics of Manchu, we classify the most common errors in ManWav into the following four categories: (1) confusion involving //, (2) confusion and nasalizing of nasal sounds in word-final positions, (3) assimilation between stops, and (4) confusion between /w/ and /x/.

First, there are some uncaptured or mismatched // sounds in the inference results, particularly in word-final or between sonorants (e.g., /l/) and stops. This occurs because // can be neutralized with other vowels or even deleted, posing challenges in accurate transcription. As shown in table 4, the locative marker de and am ‘dad’ are sometimes captured as d and am, indicating apocope of //. The loss of // is also evident in dulke, which originally included // between the sonorant /l/ and the stop /k/.

Moreover, nasal sounds /n/ and /m/ in word-final positions are frequently overlooked during inference. This could be attributed to the nature of nasal sounds, as they tend to be fused with subsequent vowels, resulting in nasalized vowels, or they may be omitted altogether. The word gunin ‘thought’ is an instance of this phenomenon. It is often transcribed as gunim, where the final /n/ appears as /m/. The occurrence of nasal stops can sometimes be mistaken for the deletion of the nasalized preceding vowel. For example, the /n/ sound in ilan ‘three’ typically nasalizes the following vowels and then is deleted. However, our model erroneously retained the nasal sound in the transcription ilan, preserving the final /n/.

Third, the inference results contain pairs that have undergone assimilation based on the articulated position. These pairs were not transcribed as assimilated forms, but this kind of assimilation is a highly productive phenomenon in natural languages. For instance, the /mg/ sequence in damgu ‘tobacco’ became /\textipaNg/ in our inference results. This is unsurprising since both /\textipaN/ and /g/ are velar whereas /m/ is bilabial.

Lastly, confusion between intervocalic /w/ and /x/ is frequently observed. To be specific, šawulo ‘cold’ is recognized as šaxulo in our model. Given that /w/ is the labial approximant and /x/ is the palatal approximant, it can be noted that these two sounds occupy distinct articulatory positions. However, there is no equivalent unvoiced sound for /w/, and discerning the voicing of approximants becomes challenging when they are in intervocalic positions.

The above four types of mismatch and corresponding examples are elaborated in Table 5.

Mismatch Types	Examples
(1) / __#, R__C	d : d, am : am, dulk : dulk
(2) n, m / __#	gunin : gunim, ilan : ila
(3) assimilation	damgu : da\textipaNgu
(4) w : x / V__V	šaxulo : šawulo

Table 5: Observed mismatch examples from the inference results written in phonological notations. R refers to sonorants, C consonants, and V vowels. # means boundary of words; __# means word-final position.

6 Related Work

6.1 ASR research in low-resource languages

There exist some endeavors to apply ASR to low-resource languages. For example, Safonova et al. (2022) collect a speech dataset in the Chukchi language and train an XLSR model. Similarly, Qin et al. (2022) improve low-resource Tibetan ASR while Jimerson and Prud’hommeaux (2018) introduce a fully functional ASR system tailored for Seneca, an endangered indigenous language of North America. Singh et al. (2023) propose an effective self-training approach capable of generating accurate pseudo-labels for unlabeled low-resource speech, particularly for the Punjabi language. Furthermore, Zhou et al. (2022) explore training strategies for efficient data utilization and Bartelds et al. (2023) investigate data augmentation methods to enhance ASR systems for low-resource scenarios. Other efforts for multilingual ASR or adapting to low-resource scenarios include Kaldi-toolkit³³3https://kaldi-asr.org/index.html, IARPA Babel project⁴⁴4https://www.iarpa.gov/research-programs/babel. However, as an extremely endangered language, Manchu has been isolated from all these efforts.

6.2 Wav2Vec 2.0

The core innovation of Wav2Vec 2.0 (Baevski et al., 2020) lies in its ability to effectively capture the contextual information in speech through its Transformer-based architecture (Vaswani et al., 2023). Wav2Vec 2.0 leverages self-supervised training, allowing the training of an ASR model with a minimal amount of labeled data, provided there is an ample supply of unlabeled data. Wav2Vec 2.0 is effective not only in capturing diverse dialects but also in accommodating various languages. XLSR (Conneau et al., 2020) is built on Wav2Vec 2.0 and learns cross-lingual speech representations from raw waveform of speech in multiple languages. XLSR-53 is particularly pretrained on 53 languages, and fine-tuned for Connectionist Temporal Classification(CTC) speech recognition. CTC is a technique used in encoder-only transformer models such as Wav2Vec 2.0, HuBERT (Hsu et al., 2021) and M-CTC-T (Lugosch et al., 2022).

7 Conclusion and Future Work

As an extremely low resource language, Manchu has often been overlooked in linguistic technology. In an effort to maximize the utilization of available Manchu data, the development of an ASR system is essential. We introduce ManWav, which involves fine-tuning Wav2Vec2-XLSR-53 on augmented Manchu audio data, with the aim of providing a valuable tool for the study and preservation of Manchu. As the addition of a decoder to an ASR model is known to boost the inference performance (Karita et al., 2019; Zeyer et al., 2019), enhancing the inference quality with the help of a language model should be studied in the future.

Limitations

The primary constraint of this research lies in the scarcity of Manchu audio data. As the audio data used in this research consists only of Colloquial Manchu from one speaker, utilizing ManWav in other domains would not show optimized performances, given that ASR models are usually heavily domain-dependent.

Ethics Statement

The project paves the way for further innovations in the field and emphasizes the importance of inclusivity in technological advancements, ensuring that the benefits of state-of-the-art technologies are accessible to all linguistic groups, regardless of their resource status. To support further ASR studies on endangered languages, we plan to release ManWav in public.

References

Baevski et al. (2020) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations.
Bartelds et al. (2023) Martijn Bartelds, Nay San, Bradley McDonnell, Dan Jurafsky, and Martijn Wieling. 2023. Making more of little data: Improving low-resource automatic speech recognition using data augmentation.
Choi et al. (2012) Wonho Choi, Hyunjo You, and Juwon Kim. 2012. The documentation of endangered altaic languages and the creation of a digital archive to safeguard linguistic diversity. International Journal of Intangible Heritage, 0(7):103–111.
Conneau et al. (2020) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised cross-lingual representation learning for speech recognition.
Grosman (2022) Jonatas Grosman. 2022. HuggingSound: A toolkit for speech-related tasks based on Hugging Face’s tools. https://github.com/jonatasgrosman/huggingsound.
Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.
Jimerson and Prud’hommeaux (2018) Robbie Jimerson and Emily Prud’hommeaux. 2018. ASR for documenting acutely under-resourced indigenous languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Karita et al. (2019) Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, and Wangyou Zhang. 2019. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE.
Kharitonov et al. (2020) Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, and Emmanuel Dupoux. 2020. Data augmenting contrastive learning of speech representations in the time domain.
Kim et al. (2008) Juwon Kim, Dongho Ko, Chaoke D. O., and Boldyrev B. V. Han Youfeng, Piao Lianyu. 2008. Materials of Spoken Manchu. Seoul National University Press.
Lee et al. (2024) Sangah Lee, Sungjoo Byun, Jean Seo, and Minha Kang. 2024. ManNER & ManPOS: Pioneering NLP for endangered Manchu language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11030–11039, Torino, Italia. ELRA and ICCL.
Lugosch et al. (2022) Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. 2022. Pseudo-labeling for massively multilingual speech recognition.
Qin et al. (2022) S. Qin, L. Wang, S. Li, Longbiao Wang, Sheng Li, Jianwu Dang, and Lixin Pan. 2022. Improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling. EURASIP Journal on Audio, Speech, and Music Processing, 2022.
Safonova et al. (2022) Anastasia Safonova, Tatiana Yudina, Emil Nadimanov, and Cydnie Davenport. 2022. Automatic speech recognition of low-resource languages based on chukchi.
Seo et al. (2023) Jean Seo, Sungjoo Byun, Minha Kang, and Sangah Lee. 2023. Mergen: The first manchu-korean machine translation model trained on augmented data.
Singh et al. (2023) Satwinder Singh, Feng Hou, and Ruili Wang. 2023. A novel self-training approach for low-resource speech recognition.
Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention is all you need.
You (2014) Hyun-Jo You. 2014. A manchu speller: With a practical introduction to the natural language processing of minority languages. Altai Hakpo, 24:39–67.
Zeyer et al. (2019) Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schluter, and Hermann Ney. 2019. A comparison of transformer and lstm encoder decoder models for asr. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 8–15.
Zhou et al. (2022) Zhikai Zhou, Wei Wang, Wangyou Zhang, and Yanmin Qian. 2022. Exploring effective data utilization for low-resource speech recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8192–8196.