Kathleen Siminyu

Followers

Following

Public Views

Interests

Uploads

Papers by Kathleen Siminyu

Responsible Artificial Intelligence in Sub-Saharan Africa: Landscape and State of Play

Road map for research on responsible artificial intelligence for development (AI4D) in African countries: The case study of agriculture

Patterns

Masakhane - Machine Translation For Africa

Africa has over 2000 languages. Despite this, African languages account for a small portion of av... more Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.

Download

1st AfricaNLP Workshop Proceedings, 2020

Proceedings of the 1st AfricaNLP Workshop held on 26th April alongside ICLR 2020, Virtual Confere... more

AI4D - African Language Program

ArXiv, 2021

Advances in speech and language technologies enable tools such as voice-search, text-tospeech, sp... more Advances in speech and language technologies enable tools such as voice-search, text-tospeech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets throu...

Download

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments

ArXiv, 2021

There is growing interest in ASR systems that can recognize phones in a language-independent fash... more There is growing interest in ASR systems that can recognize phones in a language-independent fashion. There is additionally interest in building language technologies for low-resource and endangered languages. However, there is a paucity of realistic data that can be used to test such systems and technologies. This paper presents a publicly available, phonetically transcribed corpus of 2255 utterances (words and short phrases) in the endangered Tangkhulic language East Tusom (no ISO 639-3 code), a Tibeto-Burman language variety spoken mostly in India. Because the dataset is transcribed in terms of phones, rather than phonemes, it is a better match for universal phone recognition systems than many larger (phonemically transcribed) datasets. This paper describes the dataset and the methodology used to produce it. It further presents basic benchmarks of state-of-the-art universal phone recognition systems on the dataset as baselines for future experiments.

Download

Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties

Interspeech 2021