|
1 | | -# Amphion |
2 | | - |
3 | | -Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model. |
| 1 | +# Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit |
| 2 | + |
| 3 | +<div> |
| 4 | + <a href=""><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a> |
| 5 | + <a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a> |
| 6 | + <a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a> |
| 7 | + <a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a> |
| 8 | + <a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a> |
| 9 | + <a href="egs/metrics/README.md"><img src="https://img.shields.io/badge/README-Evaluation-yellow"></a> |
| 10 | + <a href="LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-red"></a> |
| 11 | +</div> |
| 12 | +<br> |
| 13 | + |
| 14 | +**Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation.** Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: **visualizations** of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model. |
| 15 | + |
| 16 | +**The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio.** Amphion is designed to support individual generation tasks, including but not limited to, |
| 17 | + |
| 18 | +- **TTS**: Text to Speech (⛳ supported) |
| 19 | +- **SVS**: Singing Voice Synthesis (👨💻 developing) |
| 20 | +- **VC**: Voice Conversion (👨💻 developing) |
| 21 | +- **SVC**: Singing Voice Conversion (⛳ supported) |
| 22 | +- **TTA**: Text to Audio (⛳ supported) |
| 23 | +- **TTM**: Text to Music (👨💻 developing) |
| 24 | +- more… |
4 | 25 |
|
5 | | -The North-Star objective of Amphion is to offer a platform for studying the conversion of various inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to, |
| 26 | +In addition to the specific generation tasks, Amphion also includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. |
6 | 27 |
|
7 | | -- TTS: Text to Speech Synthesis (supported) |
8 | | -- SVS: Singing Voice Synthesis (planning) |
9 | | -- VC: Voice Conversion (planning) |
10 | | -- SVC: Singing Voice Conversion (supported) |
11 | | -- TTA: Text to Audio (supported) |
12 | | -- TTM: Text to Music (planning) |
13 | | -- more… |
| 28 | +## 🚀 News |
14 | 29 |
|
15 | | -In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. |
| 30 | +- **2023/11/28**: Amphion alpha release |
16 | 31 |
|
17 | | -## Key Features |
| 32 | +## ⭐ Key Features |
18 | 33 |
|
19 | | -### TTS: Text to speech |
| 34 | +### TTS: Text to Speech |
20 | 35 |
|
21 | | -- Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. |
22 | | -- It supports the following models or architectures, |
23 | | - - **[FastSpeech2](https://arxiv.org/abs/2006.04558)**: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. |
24 | | - - **[VITS](https://arxiv.org/abs/2106.06103)**: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning |
25 | | - - **[Vall-E](https://arxiv.org/abs/2301.02111)**: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. |
26 | | - - **[NaturalSpeech2](https://arxiv.org/abs/2304.09116)**: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. |
| 36 | +- Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures: |
| 37 | + - [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. |
| 38 | + - [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning |
| 39 | + - [Vall-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes. |
| 40 | + - [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. |
27 | 41 |
|
28 | 42 | ### SVC: Singing Voice Conversion |
29 | 43 |
|
30 | | -- It supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). |
31 | | -- It implements several state-of-the-art model architectures, including diffusion-based and Transformer-based models. The diffusion-based architecture uses [Bidirectoinal dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) and [U-Net](https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28) as a backend and supports [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv). |
| 44 | +- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper. [](https://arxiv.org/abs/2310.11160) [](egs/svc/MultipleContentsSVC) |
| 45 | +- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv). |
32 | 46 |
|
33 | 47 | ### TTA: Text to Audio |
34 | 48 |
|
35 | | -- **Supply TTA with latent diffusion model**, including: |
36 | | - - **[AudioLDM](https://arxiv.org/abs/2301.12503)**: a two stage model with an autoencoder and a latent diffusion model |
| 49 | +- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [](https://arxiv.org/abs/2304.00830) [](egs/tta/RECIPE.md) |
37 | 50 |
|
38 | 51 | ### Vocoder |
39 | 52 |
|
40 | | -- Amphion supports both classic and state-of-the-art neural vocoders, including |
41 | | - - GAN-based vocoders: **[MelGAN](https://arxiv.org/abs/1910.06711)**, **[HiFi-GAN](https://arxiv.org/abs/2010.05646)**, **[NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts)**, **[BigVGAN](https://arxiv.org/abs/2206.04658)**, **[APNet](https://arxiv.org/abs/2305.07952)** |
42 | | - - Flow-based vocoders: **[WaveGlow](https://arxiv.org/abs/1811.00002)** |
43 | | - - Diffusion-based vocoders: **[Diffwave](https://arxiv.org/abs/2009.09761)** |
44 | | - - Auto-regressive based vocoders: **[WaveNet](https://arxiv.org/abs/1609.03499)**, **[WaveRNN](https://arxiv.org/abs/1802.08435v1)** |
| 53 | +- Amphion supports various widely-used neural vocoders, including: |
| 54 | + - GAN-based vocoders: [MelGAN](https://arxiv.org/abs/1910.06711), [HiFi-GAN](https://arxiv.org/abs/2010.05646), [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts), [BigVGAN](https://arxiv.org/abs/2206.04658), [APNet](https://arxiv.org/abs/2305.07952). |
| 55 | + - Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002). |
| 56 | + - Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761). |
| 57 | + - Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1). |
| 58 | +- Amphion provides the official implementation of [Multi-Scale Constant-Q Transfrom Discriminator](https://arxiv.org/abs/2311.14957). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [](https://arxiv.org/abs/2311.14957) [](egs/vocoder/gan/tfr_enhanced_hifigan) |
45 | 59 |
|
46 | 60 | ### Evaluation |
47 | 61 |
|
48 | | -We supply a comprehensive objective evaluation for the generated audios. The evaluation metrics contain: |
49 | | - |
50 | | -- **F0 Modeling** |
51 | | - - F0 Pearson Coefficients |
52 | | - - F0 Periodicity Root Mean Square Error |
53 | | - - F0 Root Mean Square Error |
54 | | - - Voiced/Unvoiced F1 Score |
55 | | -- **Energy Modeling** |
56 | | - - Energy Pearson Coefficients |
57 | | - - Energy Root Mean Square Error |
58 | | -- **Intelligibility** |
59 | | - - Character/Word Error Rate based [Whisper](https://github.com/openai/whisper) |
60 | | -- **Spectrogram Distortion** |
61 | | - - Frechet Audio Distance (FAD) |
62 | | - - Mel Cepstral Distortion (MCD) |
63 | | - - Multi-Resolution STFT Distance (MSTFT) |
64 | | - - Perceptual Evaluation of Speech Quality (PESQ) |
65 | | - - Short Time Objective Intelligibility (STOI) |
66 | | - - Signal to Noise Ratio (SNR) |
67 | | -- **Speaker Similarity** |
68 | | - - Cosine similarity based [RawNet3](https://github.com/Jungjee/RawNet) |
| 62 | +Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain: |
| 63 | + |
| 64 | +- **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc. |
| 65 | +- **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc. |
| 66 | +- **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more. |
| 67 | +- **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc. |
| 68 | +- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), and more. |
| 69 | + |
| 70 | +### Datasets |
| 71 | + |
| 72 | +Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating). |
| 73 | + |
| 74 | +## 📀 Installation |
| 75 | + |
| 76 | +```bash |
| 77 | +git clone https://github.com/open-mmlab/Amphion.git |
| 78 | +cd Amphion |
| 79 | + |
| 80 | +# Install Python Environment |
| 81 | +conda create --name amphion python=3.9.15 |
| 82 | +conda activate amphion |
| 83 | + |
| 84 | +# Install Python Packages Dependencies |
| 85 | +sh env.sh |
| 86 | +``` |
| 87 | + |
| 88 | +## 🐍 Usage in Python |
| 89 | + |
| 90 | +We detail the instructions of different tasks in the following recipes: |
| 91 | + |
| 92 | +- [Text to Speech (TTS)](egs/tts/README.md) |
| 93 | +- [Singing Voice Conversion (SVC)](egs/svc/README.md) |
| 94 | +- [Text to Audio (TTA)](egs/tta/README.md) |
| 95 | +- [Vocoder](egs/vocoder/README.md) |
| 96 | +- [Evaluation](egs/metrics/README.md) |
| 97 | + |
| 98 | +## 🙏 Acknowledgement |
| 99 | + |
| 100 | + |
| 101 | +- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) and [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) for model architecture code. |
| 102 | +- [lifeiteng's VALL-E](https://github.com/lifeiteng/vall-e) for training pipeline and model architecture design. |
| 103 | +- [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code. |
| 104 | +- [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy. |
| 105 | +- [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks. |
| 106 | +- [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design. |
| 107 | +- [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools. |
| 108 | + |
| 109 | + |
| 110 | +## ©️ License |
| 111 | + |
| 112 | +Amphion is under the [MIT License](LICENSE). It is free for both research and commercial use cases. |
| 113 | + |
| 114 | +## 📚 Citations |
69 | 115 |
|
| 116 | +Stay tuned, Coming soon! |
0 commit comments