Singing voice conversion (SVC) is a task of converting the perception of the source speaker's identity to the target speaker without changing lyrics and rhythm. Recent approaches in traditional voice conversion involve the use of the generative models, such as Variational Autoencoders (VAE), and Generative Adversarial Networks (GANs). However, in the case of SVC, GANs are not explored much. The only system that has been proposed in the literature uses traditional GAN on the parallel data. The parallel data collection for real scenarios (with the same background music) is not feasible. Moreover, in the presence of background music, SVC is one of the most challenging tasks as it involves the source separation of vocals from the inputs, which will have some noise. Therefore, in this paper, we propose transfer learning, and fine-tuning-based Cycle consistent GAN (CycleGAN) model for non-parallel SVC, where music source separation is done using Deep Attractor Network (DANet). We designed seven different possible systems to identify the best possible combination of transfer learning and fine-tuning. Here, we use a more challenging database, MUSDB18, as our primary dataset, and we also use the NUS-48E database to pre-train CycleGAN. We perform extensive analysis via objective and subjective measures and report that with a $4.14$ MOS score out of $5$ for naturalness, the CycleGAN model pre-trained on NUS-48E corpus performs the best compared to the other systems described in the paper.
Model |
Conversion |
Original |
Converted |
---|---|---|---|
Scenario1 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario2 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario3 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario4 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario5 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario6 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |
|||
Scenario7 |
F ⇒ F |
||
M ⇒ F |
|||
F ⇒ M |
|||
M ⇒ M |