Speech examples of "Environment Aware Text-to-Speech Synthesis"

paper: Arxiv


1.Seen combination of speaker and environment:
In each sample, a reference utterance from human is provided to specify the speaker and environment. The speaker embedding and environment embedding are both extracted from the reference utterance, then a new speech of different text content is synthesized based on both embeddings. Two synthesized speech from proposed and baseline systems are provided. This is to examine the generation ability of TTS system.

speaker and environment reference speech synthesized speech from baseline system synthesized speech from proposed system

2.Unseen combination of seen speaker and seen environment:
In each sample, a speaker reference utterance and an environment reference utterance from human are provided to specify the speaker and environment respectively. The speaker embedding and environment embedding are extracted from speaker and environment reference utterance respectively, then a new speech is synthesized based on both embeddings. Two synthesized speech from proposed and baseline systems are provided. It should be noted that, the speaker identity and environment type are in the training dataset. However, the combination of speaker and environment is not. This is to examine the disentanglement ability of TTS system.

speaker reference speech environment reference speech synthesized speech from baseline system synthesized speech from proposed system

3.Unseen combination of unseen speaker and unseen environment:
In each sample, a speaker reference utterance and an environment reference utterance from human are provided to specify the speaker and environment respectively. The speaker embedding and environment embedding are extracted from speaker and environment reference utterance respectively, then a new speech is synthesized based on both embeddings. Two synthesized speech from proposed and baseline systems are provided. It should be noted that, the speaker identity and environment type are not in the training dataset. This is to examine the generalization ability of TTS system.

speaker reference speech environment reference speech synthesized speech from baseline system synthesized speech from proposed system