This repository provides an implementation of the paper Conformer: Convolution-augmented Transformer for Speech Recognition. It includes training scripts including support for distributed GPU training using Lightning AI and web-app for inference using Gradio.
- Attention Is All You Need
- Conformer: Convolution-augmented Transformer for Speech Recognition
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- KenLM
- Boosting Sequence Generation Performance with Beam Search Language Model Decoding
git clone https://github.com/LuluW8071/Conformer.git
cd Conformer
Before installing dependencies, ensure the following are installed:
- CUDA Toolkit (For Training)
- PyTorch (CPU or GPU version)
- SOX:
sudo apt update sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev
Install the remaining dependencies:
pip install -r requirements.txt
To preprocess the Common Voice dataset:
python3 common_voice.py \
--file_path /path/to/validated.tsv \
--save_json_path converted_clips \
-w 4 \
--percent 10
To record your own voice, use Mimic Record Studio and prepare it for training :
python3 mimic_record.py \
--input_file /path/to/transcript.txt \
--output_dir /path/to/save \
--percent 20 \
--upsample 5 # Duplicate 5 times in train json only
Note: The
--upsample
flag duplicates train json only to increase sample size.
Combine personal recordings and datasets into a single JSON file:
python3 merge_jsons.py personal/train.json converted_clips/train.json \
--output merged_train.json
Perform same operation for validation json file.
Before starting, add your Comet ML API key and project name to the .env
file.
To train the Conformer model:
python3 train.py \
-g 4 \ # Number of GPUs
-w 8 \ # Number of CPU workers
--epochs 100 \ # Number of epochs
--batch_size 32 \ # Batch size
-lr 4e-5 \ # Learning rate
--precision 16-mixed \ # Mixed precision training
--checkpoint_path /path/to/checkpoint.ckpt # Optional: Resume from a checkpoint
python3 engine.py \
--checkpoint_path /path/to/checkpoint.ckpt \
See notebook for inference examples.
Dataset | Usage | Duration (Hours) | Description |
---|---|---|---|
Mozilla Common Voice 7.0 + Personal Recordings | Training | ~1855 + 20 | Crowd-sourced and personal audio recordings |
Validation | ~161 + 2 | Validation split (8%) | |
LibriSpeech | Training | ~960 | Train-clean-100, Train-clean-360, Train-other-500 |
Validation | ~10.5 | Test-clean, Test-other |
LibriSpeech | Mozilla Corpus + Personal Recordings |
---|---|
Note
The model trained on the Mozilla Corpus dataset shows a slightly higher WER compared to the LibriSpeech dataset. However, it's important to note that the Mozilla validation was conducted on a dataset 15 times larger than the LibriSpeech validation set.
Dataset | WER (%) | Model Link |
---|---|---|
LibriSpeech | 22.94 | 🔗 |
Mozilla Corpus | 25.29 | 🔗 |
Expected WER with CTC + KenLM decoding: ~15%.
@misc{gulati2020conformer,
title={Conformer: Convolution-augmented Transformer for Speech Recognition},
author={Anmol Gulati et al.},
year={2020},
url={https://arxiv.org/abs/2005.08100},
}