Voice conversion (VC) aims to transform a source speaker's voice to sound like a target speaker, while preserving the original speech content. Traditional methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic data generated by a high-quality, pretrained multi-speaker text-to-speech (TTS) model. Specifically, we create synthetic data pairs that share the same linguistic content but differ in speaker identity, using them as input-output pairs for training the voice conversion model. Additionally, we introduce a flexible training strategy for any-to-any voice conversion. This method generalizes well to unseen speakers, improving both adaptability and performance in zero-shot scenarios.
Source | Target |
---|---|
Source Target |
Source | Target | OOVC | FreeVC | KNNVC | Diff_HierVC | FACODEC | DDDMVC |
---|---|---|---|---|---|---|---|---|
1320-122612-0000
7729-102255-0045
|
||||||||
8230-279154-0028
4970-29095-0008
|
||||||||
5105-28240-0006
3570-5694-0012
|
||||||||
4992-41806-0002
1580-141083-0001
|
||||||||
1221-135766-0000
4992-41797-0014
|
||||||||
1284-1181-0009
5639-40744-0013
|
||||||||
121-127105-0018
237-126133-0000
|
||||||||
3729-6852-0037
5639-40744-0032
|
||||||||
1580-141083-0000
8230-279154-0028
|
||||||||
6829-68771-0023
5142-36377-0018
|
Source Target |
Source | Target | OOVC | wo F0 Encoder | wo Finetuning | FreeVC |
---|---|---|---|---|---|---|
3570-5695-0003
260-123286-0030
|
||||||
4970-29095-0014
5142-36377-0006
|
||||||
260-123288-0006
2961-960-0010
|
||||||
260-123440-0006
5105-28240-0017
|
||||||
1320-122612-0015
8555-284447-0017
|
||||||
2961-961-0008
1188-133604-0014
|
||||||
4077-13754-0015
4507-16021-0024
|
||||||
8455-210777-0019
4446-2275-0032
|
||||||
8230-279154-0015
3729-6852-0036
|
||||||
237-134500-0005
1089-134691-0025
|
Source Target |
Source | Target | ZH | ZH_finetune_stage2 |
---|---|---|---|---|
SSB10240266.wav
SSB18370265.wav
|
||||
SSB01490374.wav
SSB07780296.wav
|
||||
SSB03820178.wav
SSB03940193.wav
|
||||
SSB00430464.wav
SSB04150255.wav
|
||||
SSB05350097.wav
SSB16240048.wav
|
Source Target |
Source | Target | IT | IT_finetune_stage2 |
---|---|---|---|---|
2019_1577_000361
4649_3829_001237
|
||||
10446_10415_000189
6348_5862_000113
|
||||
1595_3627_000103
4974_4125_000245
|
||||
1157_529_000032
9772_10624_000400
|
||||
2019_1577_000459
8828_8610_000316
|
Source Target |
Source | Target | VI | VI_finetune_stage2 |
---|---|---|---|---|
12_11389.wav
33_16873.wav
|
||||
39_29414.wav
46_33203.wav
|
||||
76_59715.wav
85_53833.wav
|
||||
81_47896.wav
VIVOSSPK06_T033.wav
|
||||
89_22849.wav
VIVOSSPK36_112.wav
|