Speech examples of "Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement"

paper: Arxiv

Author: Daxin Tan, Tan Lee

In this part, 10 samples are provided. Each sample comprises two groundtruth utterances and four synthesized utterances. Each synthesized utterance is generated with a phoneme embedding sequence and a style embedding sequence, of which each embedding sequence corresponds to one of the groundtruth utterances, We name the synthesized utterance by their source of phoneme embedding sequence and style embedding sequence. For example, 'content 1 style 2 synthesized utterance' refers to the synthesized utterance that the phoneme embedding sequence correspond to the text of groundtruth utterance 1 and the style embedding sequence is derived from groundtruth utterance 2. Ideally, the synthesized utterance should has similar content to its content source and similar style to its style source.

sample 1:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 2:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 3:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 4:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 5:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 6:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 7:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 8:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 9:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

sample 10:

groundtruth utterance 1:
groundtruth utterance 2:

content 1 style 1 synthesized utterance:
content 2 style 2 synthesized utterance:

content 1 style 2 synthesized utterance:
content 2 style 1 synthesized utterance:

groundtruth utterance 1:	groundtruth utterance 2:
content 1 style 1 synthesized utterance:	content 2 style 2 synthesized utterance:
content 1 style 2 synthesized utterance:	content 2 style 1 synthesized utterance: