Conformer: Convolution-augmented Transformer for Speech Recognition

This repository provides an implementation of the paper Conformer: Convolution-augmented Transformer for Speech Recognition. It includes training scripts including support for distributed GPU training using Lightning AI and web-app for inference using Gradio.

📄 Paper and Blog References

Installation

1. Clone the Repository

git clone https://github.com/LuluW8071/Conformer.git
cd Conformer

2. Install Dependencies

Before installing dependencies, ensure the following are installed:

CUDA Toolkit (For Training)
PyTorch (CPU or GPU version)

SOX:

sudo apt update
sudo apt install sox libsox-fmt-all build-essential zlib1g-dev libbz2-dev liblzma-dev

Install the remaining dependencies:

pip install -r requirements.txt

Usage

Audio Preprocessing

1. Common Voice Conversion

To preprocess the Common Voice dataset:

python3 common_voice.py \
    --file_path /path/to/validated.tsv \
    --save_json_path converted_clips \
    -w 4 \
    --percent 10

2. Personal Recordings

To record your own voice, use Mimic Record Studio and prepare it for training :

python3 mimic_record.py \
    --input_file /path/to/transcript.txt \
    --output_dir /path/to/save \
    --percent 20 \
    --upsample 5  # Duplicate 5 times in train json only

Note: The --upsample flag duplicates train json only to increase sample size.

3. Merge JSON Files

Combine personal recordings and datasets into a single JSON file:

python3 merge_jsons.py personal/train.json converted_clips/train.json \
    --output merged_train.json

Perform same operation for validation json file.

Training

Before starting, add your Comet ML API key and project name to the .env file.

To train the Conformer model:

python3 train.py \
    -g 4 \                    # Number of GPUs
    -w 8 \                    # Number of CPU workers
    --epochs 100 \            # Number of epochs
    --batch_size 32 \         # Batch size
    -lr 4e-5 \                # Learning rate
    --precision 16-mixed \    # Mixed precision training
    --checkpoint_path /path/to/checkpoint.ckpt  # Optional: Resume from a checkpoint

Inference

python3 engine.py \
    --checkpoint_path /path/to/checkpoint.ckpt \

See notebook for inference examples.

Experiment Details

Datasets

Dataset	Usage	Duration (Hours)	Description
Mozilla Common Voice 7.0 + Personal Recordings	Training	~1855 + 20	Crowd-sourced and personal audio recordings
	Validation	~161 + 2	Validation split (8%)
LibriSpeech	Training	~960	Train-clean-100, Train-clean-360, Train-other-500
	Validation	~10.5	Test-clean, Test-other

Results

Loss Curves

LibriSpeech	Mozilla Corpus + Personal Recordings

Word Error Rate (WER)

Note

The model trained on the Mozilla Corpus dataset shows a slightly higher WER compared to the LibriSpeech dataset. However, it's important to note that the Mozilla validation was conducted on a dataset 15 times larger than the LibriSpeech validation set.

Dataset	WER (%)	Model Link
LibriSpeech	22.94	🔗
Mozilla Corpus	25.29	🔗

Expected WER with CTC + KenLM decoding: ~15%.

Citation

@misc{gulati2020conformer,
      title={Conformer: Convolution-augmented Transformer for Speech Recognition}, 
      author={Anmol Gulati et al.},
      year={2020},
      url={https://arxiv.org/abs/2005.08100}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Conformer		Conformer
assets		assets
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
dataset.py		dataset.py
engine.py		engine.py
load_checkpoint.py		load_checkpoint.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conformer: Convolution-augmented Transformer for Speech Recognition

📄 Paper and Blog References

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Audio Preprocessing

1. Common Voice Conversion

2. Personal Recordings

3. Merge JSON Files

Training

Inference

Experiment Details

Datasets

Results

Loss Curves

Word Error Rate (WER)

Citation

About

Releases

Packages

Languages

License

LuluW8071/Conformer

Folders and files

Latest commit

History

Repository files navigation

Conformer: Convolution-augmented Transformer for Speech Recognition

📄 Paper and Blog References

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Audio Preprocessing

1. Common Voice Conversion

2. Personal Recordings

3. Merge JSON Files

Training

Inference

Experiment Details

Datasets

Results

Loss Curves

Word Error Rate (WER)

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages