O_O VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

Voice conversion (VC) aims to transform a source speaker's voice to sound like a target speaker, while preserving the original speech content. Traditional methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic data generated by a high-quality, pretrained multi-speaker text-to-speech (TTS) model. Specifically, we create synthetic data pairs that share the same linguistic content but differ in speaker identity, using them as input-output pairs for training the voice conversion model. Additionally, we introduce a flexible training strategy for any-to-any voice conversion. This method generalizes well to unseen speakers, improving both adaptability and performance in zero-shot scenarios.

0. Sample of Synthetic Data Pairs Used for Training

Source Target

1. Any-to-Any Voice Conversion

Source
Target
Source Target OOVC FreeVC KNNVC Diff_HierVC FACODEC DDDMVC
1320-122612-0000
7729-102255-0045
8230-279154-0028
4970-29095-0008
5105-28240-0006
3570-5694-0012
4992-41806-0002
1580-141083-0001
1221-135766-0000
4992-41797-0014
1284-1181-0009
5639-40744-0013
121-127105-0018
237-126133-0000
3729-6852-0037
5639-40744-0032
1580-141083-0000
8230-279154-0028
6829-68771-0023
5142-36377-0018

2. Ablation Study

Source
Target
Source Target OOVC wo F0 Encoder wo Finetuning FreeVC
3570-5695-0003
260-123286-0030
4970-29095-0014
5142-36377-0006
260-123288-0006
2961-960-0010
260-123440-0006
5105-28240-0017
1320-122612-0015
8555-284447-0017
2961-961-0008
1188-133604-0014
4077-13754-0015
4507-16021-0024
8455-210777-0019
4446-2275-0032
8230-279154-0015
3729-6852-0036
237-134500-0005
1089-134691-0025

3. Adapt to New Language

ZH

Source
Target
Source Target ZH ZH_finetune_stage2
SSB10240266.wav
SSB18370265.wav
SSB01490374.wav
SSB07780296.wav
SSB03820178.wav
SSB03940193.wav
SSB00430464.wav
SSB04150255.wav
SSB05350097.wav
SSB16240048.wav

IT

Source
Target
Source Target IT IT_finetune_stage2
2019_1577_000361
4649_3829_001237
10446_10415_000189
6348_5862_000113
1595_3627_000103
4974_4125_000245
1157_529_000032
9772_10624_000400
2019_1577_000459
8828_8610_000316

VI

Source
Target
Source Target VI VI_finetune_stage2
12_11389.wav
33_16873.wav
39_29414.wav
46_33203.wav
76_59715.wav
85_53833.wav
81_47896.wav
VIVOSSPK06_T033.wav
89_22849.wav
VIVOSSPK36_112.wav