Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Yuchen Hu*, Chen Chen*, Siyin Wang, Eng Siong Chng, Chao Zhang

Abstract

In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.

RIO Framework

The overall framework of RIO.

Comparison of Zero-shot TTS Results

Sample Speech Prompt VoiceCraft RIO Ground-Truth Issue Types Text
1 Truncation Her eyes seemed to regard him with mild pity her holiness a strange light glowing faintly upon her frail flesh did not humiliate the sinner who approached her
2 Truncation Why was the sacrament of the eucharist instituted under the two species of bread and wine if jesus christ be present body and blood soul and divinity in the bread alone and in the wine alone
3 Truncation It was idle for him to move himself to be generous towards them to tell himself that if he ever came to their gates stripped of his pride beaten and in beggar's weeds that they would be generous towards him loving him as themselves
4 Truncation Number ten fresh nelly is waiting on you good night husband
5 Repetition Stephen leaning back and drawing idly on his scribbler listened to the talk about him which heron checked from time to time by saying
6 Prosody and tone How strange it seemed to the sad woman as she watched the growth and the beauty that became every day more brilliant and the intelligence that threw its quivering sunshine over the tiny features of this child
7 Prosody and tone As to any other kind of discipline whether addressed to her mind or heart little pearl might or might not be within its reach in accordance with the caprice that ruled the moment
8 Speech quality It was a look so intelligent yet inexplicable perverse sometimes so malicious but generally accompanied by a wild flow of spirits that hester could not help questioning at such moments whether pearl was a human child
9 Speech quality But pearl who was a dauntless child after frowning stamping her foot and shaking her little hand with a variety of threatening gestures suddenly made a rush at the knot of her enemies and put them all to flight
10 Speech quality She screamed and shouted too with a terrific volume of sound which doubtless caused the hearts of the fugitives to quake within them