Implemented 5 key objective metrics in version 1 of this project for evaluating TTS systems' outputs.
Each metric compares a synthetic audio sample to a reference (real) audio sample, assessing aspects such as pitch accuracy, spectral similarity, and statistical features with accompanying visualization.
- DTW (Dynamic Time Wrapping)
- MCD (Mel Cepstral Distortion)
- MSD (Mel Spectral Distortion)
- F0 Frame Error
- Stat Moments (Kurtosis, Mean, STD)
- Create a new conda environment
- Take reference audio and generate synthesis using any open source TTS (I've used LJSpeech as reference and Kokoro TTS for synthesis)
- Create a folder and inside that create 2 folders with these names (synthesis, reference)
- Put all your audio files in their respective folders
- Zip the folder and add it to the root folder
- Run main.py and give the path of the zip file and where you want to extract
- Keep uncommenting all the functions one by one and save all the metrics visualizations
- More metrics
- ASR (phoneme matching) for multiple languages