Audio samples for "SELF-SUPERVISED REPRESENTATIONS FOR SINGING VOICE SYNTHESIS"

Abstract

A singing voice conversion model converts a song in the voice of an arbitrary source singer to the voice of a target singer. Recently, methods that leverage self-supervised audio representations suchas HuBERT and Wav2Vec 2.0 have helped further the state-of-the-art. Though these methods produce more natural and melodic singing outputs, they often rely on confusion and disentanglement losses to render the self-supervised representations speaker and pitch-invariant. In this paper, we circumvent disentanglement training and propose a new model that leverages ASR fine-tuned self-supervised representations as inputs to a HiFi-GAN neural vocoder for singing voice conversion. We experiment with different f0 encoding schemes and show that an f0 harmonic generation module that uses a parallel bank of transposed convolutions (PBTC) alongside ASR fine-tuned Wav2Vec 2.0 features results in the best singing voice conversion quality. Additionally, the model is capable of making a spoken voice sing. We also show that a simple f0 shifting scheme during inference helps retain singer identity and bolsters the performance of our singing voice conversion model. Our results are backed up by extensive MOS studies that compare different ablations and baselines.

Section A

Samples from MOS Study 1. Raters were asked the judge the audio quality across different models. Each model uses HuBERT as the self-supervised feature, but varies the training procedure (e.g., using only singing data vs. singing + spoken speech data) or varies the f0 feature encoder (e.g., PBTC vs. Q-LUT).

Comparing different models with HuBERT self-supervised features

hubert-club

hubert-sing-f0-embed

hubert-pbtc

hubert-f0-embed

hubert-f0-embed-f0-shift

hubert-pbtc-f0-shift

ground-truth

Section B

Samples from MOS Study 2. Raters were asked to judge the speaker/singer similarity between the synthesized audio and the reference audio.

Target singer/speaker similarity with HuBERT self-supervised features

hubert-club

hubert-f0-embed

hubert-f0-embed-f0-shift

hubert-pbtc-f0-shift

reference

Section C

Samples from MOS Study 3. Raters were asked the judge the audio quality across different models. Each varies the self-supervised feature (HuBERT vsv Wav2Vec2.0 vs. Wav3Vec2.0-ASR), and also varies the f0 feature encoder (e.g., PBTC vs. Q-LUT).

Comparing different models by varying self-supervised and f0 features

w2v2-asr-pbtc-f0-shift

w2v2-pbtc-f0-shift

w2v2-f0-embed-f0-shift

w2v2-asr-f0-embed-f0-shift

hubert-f0-embed-f0-shift

hubert-pbtc-f0-shift

ground-truth

Section D

Samples from MOS Study 4. Raters were asked to judge the speaker/singer similarity between the synthesized audio and the reference audio with and without f0 shifting during inference.

Comparing Different Models with HuBERT Self-supervised Features