Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Language Preservation through ASR Alexander O’Neill, Marieke Meelen, Rolando Coto-Solano, Sonam Phuntsog & Charles Ramble ao34@soas.ac.uk - mm986@cam.ac.uk Stage 2 Transcription Stage 1 Fieldwork Stage 3 Training Stage 4 Optimisation Developing Speech-to-Text tools for Newar and Dzardzongke Stage 1 - Fieldwork on Endangered Languages • Half of the world’s 6500+ spoken languages will die out by the end of this century (Turin, 2007). • Of Nepal’s 120+ distinct languages identified in the 2011 census, 60 languages are endangered (Moseley, 2010). – Risk factors include globalisation, political unrest, lack of governmental and educational support, and environmental challenges. Preserving Endangered Languages • Loss of languages means loss of cultural and religious identifiers. • Methods and tools for preserving linguistic diversity are needed: – Documentation of existing oral and literary varieties. – Developing accessible NLP tools for automatic speech recognition (ASR) and handwritten text recognition (HTR). – Improving access to & creating new resources in the endangered language, e.g. dictionaries, orthographies, etc. so Fig. 1 - Map of Nepal and Tibet with a language tree of Tibeto-Burman languages Stage 2 - Transcription bottleneck: 1 min. of audio takes 40+ mins. to transcribe! • Low volume of transcriptions impacts results’ quality (Shi et al., 2021). • Endangered languages’ irregular orthography complicates transcription. • ASR tools struggle with no & low-resource languages (Foley et al., 2018). Nepal Case Studies Fig. 2 - ELAN files with segmentation & transcription (input and output) → While Newar is written using Devanagari (for consistency, we use IAST), it lacks standardised orthography (e.g. /jigu/ may be written <jiyu>). → Dzardzongke has no writing system, so we developed an orthography together with the community with a Standard Tibetan conversion. Stage 3 - Training with Wav2Vec2 + manual Newar & Dzardzongke transcriptions 3h05m Newar / 3h15m Dzardzongke & 80/10/10 splits Fig. 3 - Word Error Rates per epochs Stage 4 - Optimisation: audio manipulation, dictionaries & transfer learning We conducted an Error Analysis for the models with the best Character and Word Error Rates (CER & WER) at training epoch 2000 for Newar (CER 0.167) and 4000 for Dzardzongke (CER 0.06): After a thorough analysis of the results (bad, medium, and good above), we decided to prioritise the following optimisation strategies: ⇒ Audio manipulation through F0 normalisation and adding background noise (to follow). ⇒ Adding customised dictionaries (to follow). Fig. 4 - Results for Newar (2000) & Dzar. (4000) ⇒ Transfer learning using 6+ hours Standard Tibetan to supplement the related Dzardzongke after converting the Standard Tibetan orthography: Fig. 5 - Improvement due to transfer learning