Skip to content

Commit 9682d0c

Browse files
authored
Amphion Alpha Release (#2)
* amphion alpha release
1 parent 9f12af1 commit 9682d0c

File tree

426 files changed

+378683
-50
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

426 files changed

+378683
-50
lines changed

.gitignore

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Mac OS files
2+
.DS_Store
3+
4+
# IDEs
5+
.idea
6+
.vs
7+
.vscode
8+
.cache
9+
10+
# GitHub files
11+
.github
12+
13+
# Byte-compiled / optimized / DLL / cached files
14+
__pycache__/
15+
*.py[cod]
16+
*$py.class
17+
*.pyc
18+
.temp
19+
*.c
20+
*.so
21+
*.o
22+
23+
# Developing mode
24+
_*.sh
25+
_*.json
26+
*.lst
27+
yard*
28+
*.out
29+
evaluation/evalset_selection
30+
mfa
31+
egs/svc/*wavmark
32+
egs/svc/custom
33+
egs/svc/*/dev*
34+
egs/svc/dev_exp_config.json
35+
bins/svc/demo*
36+
data
37+
ckpts
38+
39+
# Data and ckpt
40+
*.pkl
41+
*.pt
42+
*.npy
43+
*.npz
44+
*.tar.gz
45+
*.ckpt
46+
*.wav
47+
*.flac
48+
pretrained/wenet/*conformer_exp
49+
50+
# Runtime data dirs
51+
processed_data
52+
data
53+
model_ckpt
54+
logs
55+
*.ipynb
56+
*.lst
57+
source_audio
58+
result
59+
conversion_results
60+
get_available_gpu.py

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Amphion
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 97 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,116 @@
1-
# Amphion
2-
3-
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
1+
# Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit
2+
3+
<div>
4+
<a href=""><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
5+
<a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
6+
<a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
7+
<a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
8+
<a href="egs/vocoder/README.md"><img src="https://img.shields.io/badge/README-Vocoder-purple"></a>
9+
<a href="egs/metrics/README.md"><img src="https://img.shields.io/badge/README-Evaluation-yellow"></a>
10+
<a href="LICENSE"><img src="https://img.shields.io/badge/LICENSE-MIT-red"></a>
11+
</div>
12+
<br>
13+
14+
**Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation.** Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: **visualizations** of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.
15+
16+
**The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio.** Amphion is designed to support individual generation tasks, including but not limited to,
17+
18+
- **TTS**: Text to Speech (⛳ supported)
19+
- **SVS**: Singing Voice Synthesis (👨‍💻 developing)
20+
- **VC**: Voice Conversion (👨‍💻 developing)
21+
- **SVC**: Singing Voice Conversion (⛳ supported)
22+
- **TTA**: Text to Audio (⛳ supported)
23+
- **TTM**: Text to Music (👨‍💻 developing)
24+
- more…
425

5-
The North-Star objective of Amphion is to offer a platform for studying the conversion of various inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,
26+
In addition to the specific generation tasks, Amphion also includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.
627

7-
- TTS: Text to Speech Synthesis (supported)
8-
- SVS: Singing Voice Synthesis (planning)
9-
- VC: Voice Conversion (planning)
10-
- SVC: Singing Voice Conversion (supported)
11-
- TTA: Text to Audio (supported)
12-
- TTM: Text to Music (planning)
13-
- more…
28+
## 🚀 News
1429

15-
In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.
30+
- **2023/11/28**: Amphion alpha release
1631

17-
## Key Features
32+
## ⭐ Key Features
1833

19-
### TTS: Text to speech
34+
### TTS: Text to Speech
2035

21-
- Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems.
22-
- It supports the following models or architectures,
23-
- **[FastSpeech2](https://arxiv.org/abs/2006.04558)**: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
24-
- **[VITS](https://arxiv.org/abs/2106.06103)**: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
25-
- **[Vall-E](https://arxiv.org/abs/2301.02111)**: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
26-
- **[NaturalSpeech2](https://arxiv.org/abs/2304.09116)**: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
36+
- Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
37+
- [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
38+
- [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
39+
- [Vall-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
40+
- [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
2741

2842
### SVC: Singing Voice Conversion
2943

30-
- It supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec).
31-
- It implements several state-of-the-art model architectures, including diffusion-based and Transformer-based models. The diffusion-based architecture uses [Bidirectoinal dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) and [U-Net](https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28) as a backend and supports [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv).
44+
- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-red)](egs/svc/MultipleContentsSVC)
45+
- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv).
3246

3347
### TTA: Text to Audio
3448

35-
- **Supply TTA with latent diffusion model**, including:
36-
- **[AudioLDM](https://arxiv.org/abs/2301.12503)**: a two stage model with an autoencoder and a latent diffusion model
49+
- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-red)](egs/tta/RECIPE.md)
3750

3851
### Vocoder
3952

40-
- Amphion supports both classic and state-of-the-art neural vocoders, including
41-
- GAN-based vocoders: **[MelGAN](https://arxiv.org/abs/1910.06711)**, **[HiFi-GAN](https://arxiv.org/abs/2010.05646)**, **[NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts)**, **[BigVGAN](https://arxiv.org/abs/2206.04658)**, **[APNet](https://arxiv.org/abs/2305.07952)**
42-
- Flow-based vocoders: **[WaveGlow](https://arxiv.org/abs/1811.00002)**
43-
- Diffusion-based vocoders: **[Diffwave](https://arxiv.org/abs/2009.09761)**
44-
- Auto-regressive based vocoders: **[WaveNet](https://arxiv.org/abs/1609.03499)**, **[WaveRNN](https://arxiv.org/abs/1802.08435v1)**
53+
- Amphion supports various widely-used neural vocoders, including:
54+
- GAN-based vocoders: [MelGAN](https://arxiv.org/abs/1910.06711), [HiFi-GAN](https://arxiv.org/abs/2010.05646), [NSF-HiFiGAN](https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts), [BigVGAN](https://arxiv.org/abs/2206.04658), [APNet](https://arxiv.org/abs/2305.07952).
55+
- Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
56+
- Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
57+
- Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
58+
- Amphion provides the official implementation of [Multi-Scale Constant-Q Transfrom Discriminator](https://arxiv.org/abs/2311.14957). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
4559

4660
### Evaluation
4761

48-
We supply a comprehensive objective evaluation for the generated audios. The evaluation metrics contain:
49-
50-
- **F0 Modeling**
51-
- F0 Pearson Coefficients
52-
- F0 Periodicity Root Mean Square Error
53-
- F0 Root Mean Square Error
54-
- Voiced/Unvoiced F1 Score
55-
- **Energy Modeling**
56-
- Energy Pearson Coefficients
57-
- Energy Root Mean Square Error
58-
- **Intelligibility**
59-
- Character/Word Error Rate based [Whisper](https://github.com/openai/whisper)
60-
- **Spectrogram Distortion**
61-
- Frechet Audio Distance (FAD)
62-
- Mel Cepstral Distortion (MCD)
63-
- Multi-Resolution STFT Distance (MSTFT)
64-
- Perceptual Evaluation of Speech Quality (PESQ)
65-
- Short Time Objective Intelligibility (STOI)
66-
- Signal to Noise Ratio (SNR)
67-
- **Speaker Similarity**
68-
- Cosine similarity based [RawNet3](https://github.com/Jungjee/RawNet)
62+
Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:
63+
64+
- **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
65+
- **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
66+
- **Intelligibility**: Character/Word Error Rate, which can be calculated based on [Whisper](https://github.com/openai/whisper) and more.
67+
- **Spectrogram Distortion**: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
68+
- **Speaker Similarity**: Cosine similarity, which can be calculated based on [RawNet3](https://github.com/Jungjee/RawNet), [WeSpeaker](https://github.com/wenet-e2e/wespeaker), and more.
69+
70+
### Datasets
71+
72+
Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
73+
74+
## 📀 Installation
75+
76+
```bash
77+
git clone https://github.com/open-mmlab/Amphion.git
78+
cd Amphion
79+
80+
# Install Python Environment
81+
conda create --name amphion python=3.9.15
82+
conda activate amphion
83+
84+
# Install Python Packages Dependencies
85+
sh env.sh
86+
```
87+
88+
## 🐍 Usage in Python
89+
90+
We detail the instructions of different tasks in the following recipes:
91+
92+
- [Text to Speech (TTS)](egs/tts/README.md)
93+
- [Singing Voice Conversion (SVC)](egs/svc/README.md)
94+
- [Text to Audio (TTA)](egs/tta/README.md)
95+
- [Vocoder](egs/vocoder/README.md)
96+
- [Evaluation](egs/metrics/README.md)
97+
98+
## 🙏 Acknowledgement
99+
100+
101+
- [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2) and [jaywalnut310's VITS](https://github.com/jaywalnut310/vits) for model architecture code.
102+
- [lifeiteng's VALL-E](https://github.com/lifeiteng/vall-e) for training pipeline and model architecture design.
103+
- [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), [ContentVec](https://github.com/auspicious3000/contentvec), and [RawNet3](https://github.com/Jungjee/RawNet) for pretrained models and inference code.
104+
- [HiFi-GAN](https://github.com/jik876/hifi-gan) for GAN-based Vocoder's architecture design and training strategy.
105+
- [Encodec](https://github.com/facebookresearch/encodec) for well-organized GAN Discriminator's architecture and basic blocks.
106+
- [Latent Diffusion](https://github.com/CompVis/latent-diffusion) for model architecture design.
107+
- [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) for preparing the MFA tools.
108+
109+
110+
## ©️ License
111+
112+
Amphion is under the [MIT License](LICENSE). It is free for both research and commercial use cases.
113+
114+
## 📚 Citations
69115

116+
Stay tuned, Coming soon!

bins/calc_metrics.py

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Copyright (c) 2023 Amphion.
2+
#
3+
# This source code is licensed under the MIT license found in the
4+
# LICENSE file in the root directory of this source tree.
5+
6+
import os
7+
import numpy as np
8+
import json
9+
import argparse
10+
11+
from glob import glob
12+
from tqdm import tqdm
13+
from collections import defaultdict
14+
15+
from evaluation.metrics.energy.energy_rmse import extract_energy_rmse
16+
from evaluation.metrics.energy.energy_pearson_coefficients import (
17+
extract_energy_pearson_coeffcients,
18+
)
19+
from evaluation.metrics.f0.f0_pearson_coefficients import extract_fpc
20+
from evaluation.metrics.f0.f0_periodicity_rmse import extract_f0_periodicity_rmse
21+
from evaluation.metrics.f0.f0_rmse import extract_f0rmse
22+
from evaluation.metrics.f0.v_uv_f1 import extract_f1_v_uv
23+
from evaluation.metrics.intelligibility.character_error_rate import extract_cer
24+
from evaluation.metrics.intelligibility.word_error_rate import extract_wer
25+
from evaluation.metrics.similarity.speaker_similarity import extract_speaker_similarity
26+
from evaluation.metrics.spectrogram.frechet_distance import extract_fad
27+
from evaluation.metrics.spectrogram.mel_cepstral_distortion import extract_mcd
28+
from evaluation.metrics.spectrogram.multi_resolution_stft_distance import extract_mstft
29+
from evaluation.metrics.spectrogram.pesq import extract_pesq
30+
from evaluation.metrics.spectrogram.scale_invariant_signal_to_distortion_ratio import (
31+
extract_si_sdr,
32+
)
33+
from evaluation.metrics.spectrogram.scale_invariant_signal_to_noise_ratio import (
34+
extract_si_snr,
35+
)
36+
from evaluation.metrics.spectrogram.short_time_objective_intelligibility import (
37+
extract_stoi,
38+
)
39+
40+
METRIC_FUNC = {
41+
"energy_rmse": extract_energy_rmse,
42+
"energy_pc": extract_energy_pearson_coeffcients,
43+
"fpc": extract_fpc,
44+
"f0_periodicity_rmse": extract_f0_periodicity_rmse,
45+
"f0rmse": extract_f0rmse,
46+
"v_uv_f1": extract_f1_v_uv,
47+
"cer": extract_cer,
48+
"wer": extract_wer,
49+
"speaker_similarity": extract_speaker_similarity,
50+
"fad": extract_fad,
51+
"mcd": extract_mcd,
52+
"mstft": extract_mstft,
53+
"pesq": extract_pesq,
54+
"si_sdr": extract_si_sdr,
55+
"si_snr": extract_si_snr,
56+
"stoi": extract_stoi,
57+
}
58+
59+
60+
def calc_metric(ref_dir, deg_dir, dump_dir, metrics, fs=None):
61+
result = defaultdict()
62+
63+
for metric in tqdm(metrics):
64+
if metric in ["fad", "speaker_similarity"]:
65+
result[metric] = str(METRIC_FUNC[metric](ref_dir, deg_dir))
66+
continue
67+
68+
audios_ref = []
69+
audios_deg = []
70+
71+
files = glob(ref_dir + "/*.wav")
72+
73+
for file in files:
74+
audios_ref.append(file)
75+
uid = file.split("/")[-1].split(".wav")[0]
76+
file_gt = deg_dir + "/{}.wav".format(uid)
77+
audios_deg.append(file_gt)
78+
79+
if metric in ["v_uv_f1"]:
80+
tp_total = 0
81+
fp_total = 0
82+
fn_total = 0
83+
84+
for i in tqdm(range(len(audios_ref))):
85+
audio_ref = audios_ref[i]
86+
audio_deg = audios_deg[i]
87+
tp, fp, fn = METRIC_FUNC[metric](audio_ref, audio_deg, fs)
88+
tp_total += tp
89+
fp_total += fp
90+
fn_total += fn
91+
92+
result[metric] = str(tp_total / (tp_total + (fp_total + fn_total) / 2))
93+
else:
94+
scores = []
95+
96+
for i in tqdm(range(len(audios_ref))):
97+
audio_ref = audios_ref[i]
98+
audio_deg = audios_deg[i]
99+
100+
score = METRIC_FUNC[metric](
101+
audio_ref=audio_ref, audio_deg=audio_deg, fs=fs
102+
)
103+
if not np.isnan(score):
104+
scores.append(score)
105+
106+
scores = np.array(scores)
107+
result["{}_mean".format(metric)] = str(np.mean(scores))
108+
result["{}_std".format(metric)] = str(np.std(scores))
109+
110+
data = json.dumps(result, indent=4)
111+
112+
with open(os.path.join(dump_dir, "result.json"), "w", newline="\n") as f:
113+
f.write(data)
114+
115+
116+
if __name__ == "__main__":
117+
parser = argparse.ArgumentParser()
118+
parser.add_argument(
119+
"--ref_dir",
120+
type=str,
121+
help="Path to the target audio folder.",
122+
)
123+
parser.add_argument(
124+
"--deg_dir",
125+
type=str,
126+
help="Path to the reference audio folder.",
127+
)
128+
parser.add_argument(
129+
"--dump_dir",
130+
type=str,
131+
help="Path to dump the results.",
132+
)
133+
parser.add_argument(
134+
"--metrics",
135+
nargs="+",
136+
help="Metrics used to evaluate.",
137+
)
138+
args = parser.parse_args()
139+
140+
calc_metric(args.ref_dir, args.deg_dir, args.dump_dir, args.metrics)

0 commit comments

Comments
 (0)