Audio samples for "SELF-SUPERVISED REPRESENTATIONS FOR SINGING VOICE SYNTHESIS"
Abstract
A singing voice conversion model converts a song in the voice of an arbitrary source singer to the voice of a
target singer. Recently, methods that leverage self-supervised audio representations suchas HuBERT and Wav2Vec 2.0
have helped further the state-of-the-art. Though these methods produce more natural and melodic singing outputs,
they often rely on confusion and disentanglement losses to render the self-supervised representations speaker and
pitch-invariant. In this paper, we circumvent disentanglement training and propose a new model that leverages ASR
fine-tuned self-supervised representations as inputs to a HiFi-GAN neural vocoder for singing voice conversion. We
experiment with different f0 encoding schemes and show that an f0 harmonic generation module that uses a parallel
bank of transposed convolutions (PBTC) alongside ASR fine-tuned Wav2Vec 2.0 features results in the best singing voice
conversion quality. Additionally, the model is capable of making a spoken voice sing. We also show that a simple f0
shifting scheme during inference helps retain singer identity and bolsters the performance of our singing voice conversion
model. Our results are backed up by extensive MOS studies that compare different ablations and baselines.
Samples from MOS Study 1. Raters were asked the judge the audio quality
across different models. Each model uses HuBERT as the self-supervised
feature, but varies the training procedure (e.g., using only singing data
vs. singing + spoken speech data) or varies the f0 feature encoder (e.g.,
PBTC vs. Q-LUT).
Comparing different models with HuBERT self-supervised features
hubert-club
hubert-sing-f0-embed
hubert-pbtc
hubert-f0-embed
hubert-f0-embed-f0-shift
hubert-pbtc-f0-shift
ground-truth
Section B
Samples from MOS Study 2. Raters were asked to judge the speaker/singer
similarity between the synthesized audio and the reference audio.
Target singer/speaker similarity with HuBERT self-supervised features
hubert-club
hubert-f0-embed
hubert-f0-embed-f0-shift
hubert-pbtc-f0-shift
reference
Section C
Samples from MOS Study 3. Raters were asked the judge the audio quality
across different models. Each varies the self-supervised
feature (HuBERT vsv Wav2Vec2.0 vs. Wav3Vec2.0-ASR), and also varies the
f0 feature encoder (e.g., PBTC vs. Q-LUT).
Comparing different models by varying self-supervised and f0 features
w2v2-asr-pbtc-f0-shift
w2v2-pbtc-f0-shift
w2v2-f0-embed-f0-shift
w2v2-asr-f0-embed-f0-shift
hubert-f0-embed-f0-shift
hubert-pbtc-f0-shift
ground-truth
Section D
Samples from MOS Study 4. Raters were asked to judge the speaker/singer
similarity between the synthesized audio and the reference audio with
and without f0 shifting during inference.
Comparing Different Models with HuBERT Self-supervised Features