Contrastive Predictive Coding supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations (link to follow)


Here zero-shot voice conversion is presented, i.e. with source and target speakers not seen in training. A Wavenet vocoder is used to map log-mel spectrograms to waveforms, which, however, was only trained on clean log-mel spectrograms. Note that the source and target signals in the first and second column, respectively, are the signals synthesized by the Wavenet from clean log-mel spectrograms and can be understood as a topline for the audio quality. Voice conversions from the following fully unsupervised disentanglement methods are presented:
Source Signal Target Voice CPC supported FVAE AdaIN AutoVC