SSCGAN: speech style conversion based on GAN

Jixing Li; Xiaozhou Guo; Ronxuan Shen; Huaxiang Lu; Xinggang Wang; Zhanzhong Cao; Chi Zhang; Wenyu Mao

doi:10.1117/12.2636492

6 May 2022 SSCGAN: speech style conversion based on GAN

Jixing Li, Xiaozhou Guo, Ronxuan Shen, Huaxiang Lu, Xinggang Wang, Zhanzhong Cao, Chi Zhang, Wenyu Mao

Proceedings Volume 12176, International Conference on Algorithms, Microchips and Network Applications; 121761H (2022) https://doi.org/10.1117/12.2636492
Event: International Conference on Algorithms, Microchips, and Network Applications 2022, 2022, Zhuhai, China

Abstract

Speech conversion has significant applications in medical, robotics, and other industries. With the rise of deep learning, CycleGAN is widely used in speech conversion technology. However, the existing CycleGAN-based methods do not consider the speech signal’s temporal and spatial features. In addition, the training of CycleGAN is difficult to converge due to the gradient disappearance problem of the generator. We propose SSCGAN, whose generator is a U-shaped encoder-decoder network that extracts the temporal and spatial features by using 1DCNN and 2DCNN in parallel. A feature fusion module based on multi-scale mixed convolution is embedded between encoder and decoder to achieve high-level fusion of spatial features and temporal features. To make the network training more stable and easier to converge, SSCGAN uses Wasserstein distance instead of the original Jensen–Shannon divergence to calculate the distance of the probability distribution, which can alleviate the gradient extinction problem for generators. In addition, SSCGAN utilizes the PatchGAN structure in the discriminator, which considers the samples’ local details by dividing them into different patches. It can improve the discriminative ability of SSCGAN. The experiment results in the nonparallel corpus database VCC 2018 show that SSCGAN is superior to existing methods such as CycleGAN-VC, StarGan-VC. In inter-gender speech conversion, the MSD of SSCGAN is decreased by 0.162 on average compared to other methods, and in intra-gender speech conversion, the MSD is decreased by 0.118 on average. In subjective evaluation, participants also think SSCGAN is the best.

Citation Download Citation

Jixing Li, Xiaozhou Guo, Ronxuan Shen, Huaxiang Lu, Xinggang Wang, Zhanzhong Cao, Chi Zhang, and Wenyu Mao "SSCGAN: speech style conversion based on GAN", Proc. SPIE 12176, International Conference on Algorithms, Microchips and Network Applications, 121761H (6 May 2022); https://doi.org/10.1117/12.2636492

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available