Abstract

This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and optionally the F0 contour of the target singing, the proposed model generates as the output a singing signal with a progressive-growing encoder/decoder architecture and boundary equilibrium GAN loss functions. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline. For reproducibility, the code will be publicly available at a GitHub repository upon paper publication.




Audio Samples

Sample A

Input speech:            
Jayneel:    
Progressive Learning:    
MEL+MSE:    
MEL+MSE+AD:    

Sample B

Input speech:            
Jayneel:    
Progressive Learning:    

Sample C

Input speech:            
MEL+MSE:    
MEL+AD:    

Sample D

Input speech:            
MEL+MSE:    
MEL+AD:    

Our MELGAN on DAMP

Origin:            
MELGAN:    
Origin:            
MELGAN: