Abstract

In this paper, we propose a vector quantization (VQ) based one-shot voice conversion (VC) approach without any supervision on speaker label. We model the content embedding as a series of discrete codes and take the difference between quantize-before and quantize-after vector as the speaker embedding. We show that this approach has a strong ability to disentangle the content and speaker information with reconstruction loss only, and one-shot VC is thus achieved.

Demo 1 F2M

Source	Target	Converted

Demo 2 M2F

Source	Target	Converted

Demo 3 M2M

Source	Target	Converted

Demo 3 F2F

Source	Target	Converted