📘Introduction | 🛠️Preparation | 📊Benchmark | 🔮Inference | 🐯Overview |
Prithvi Yadav
This is the repository of End-to-End Visual Speech Recognition Model which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.
- Clone the repository and enter it locally:
git clone https://github.com/prithviyadav/LipRead.git
cd LipRead- Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr- Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg- Download and extract a pre-trained model and/or language model from model zoo to:
-
./benchmarks/${dataset}/models -
./benchmarks/${dataset}/language_models
- [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.
python eval.py config_filename=[config_filename] \
labels_filename=[labels_filename] \
data_dir=[data_dir] \
landmarks_dir=[landmarks_dir]-
[config_filename]is the model configuration path, located in./configs. -
[labels_filename]is the labels path, located in${lipreading_root}/benchmarks/${dataset}/labels. -
[data_dir]and[landmarks_dir]are the directories for original dataset and corresponding landmarks. -
gpu_idx=-1can be added to switch fromcuda:0tocpu.
python infer.py config_filename=[config_filename] data_filename=[data_filename]-
data_filenameis the path to the audio/video file. -
detector=mediapipecan be added to switch from RetinaFace to MediaPipe tracker.
python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]dst_filenameis the path where the cropped mouth will be saved.
We support a number of datasets for speech recognition: