Examples of "CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction"

paper:


1.Comparison of speech recognition and alignment (S2T) modules
In this part, we make comparison between three S2T modules: Oracle, F-Align and CTC-Align.

Oracle: In this setting, the ground-truth perturbed location and perturbed content for correction.
F-Align: This setting consists of a byte pair encoding (BPE) based automatic speech recognition (ASR) model for recognizing word sequence from speech. Then we apply a grapheme-to-phoneme (G2P) module to convert the word sequence to phone sequence. Finally, we apply a forced aligner to align the word sequence and the phone sequence with the speech to obtain word-level and phone-level time stamps. These are used for correction.
CTC-Align: This setting consists of a phone-based connectionist temporal classification (CTC) ASR model for recognizing phone sequence from speech. We apply greedy decoding to obtain the recognition result. For phone-based CTC decoding, the output of each frame is either a phone label or a ``blank" label. By counting the frame index, we obtain the time-stamp of each phone, i.e. phone-level time stamps from the speech. These are used for correction.

We use Word-Phone correction method in this part.

In each sample, an original audio, a perturbed audio and three corrected audios from three different systems are provided. The original audio is the groundtruth recording in VCTK dataset. THe perturbed audio is obtained by carrying out word-level perturbation, including insertion, replacement and deletion, to the original audio. The corrected audio is obtained by feeding the perturbed audio and unperturbed text to the system as input.


Sample:1
original audio perturbed audio
original text:initial reports said the aircraft had experienced a loss of power
perturbed text:initial said the aircraft had experienced reported loss of power
Oracle system corrected audio F-Align system corrected audio CTC-Align system corrected audio




Sample:2
original audio perturbed audio
original text:salary changes are long overdue
perturbed text:salary changes are long number overdue
Oracle system corrected audio F-Align system corrected audio CTC-Align system corrected audio




Sample:3
original audio perturbed audio
original text:eight months later he was dead
perturbed text:require months later he was dead
Oracle system corrected audio F-Align system corrected audio CTC-Align system corrected audio




Sample:4
original audio perturbed audio
original text:the hibs manager is no fool
perturbed text:hands the hibs manager is no fool
Oracle system corrected audio F-Align system corrected audio CTC-Align system corrected audio




Sample:5
original audio perturbed audio
original text:the difference in the rainbow depends considerably upon the size of the drops and the width of the colored band increases as the size of the drops increases
perturbed text:the difference in the depends considerably upon the size of drops and the width tragedy of the colored band office french the size of the drops increases
Oracle system corrected audio F-Align system corrected audio CTC-Align system corrected audio




2.Comparison of speech correction methods
In this part, we make comparison between three correction methods: Word-Word, Word-Phone and Phone-Phone

Phone-Phone: Phone-level correction by phone-level alignment
Word-Word: Word-level correction by word-level alignment
Word-Phone: Word-level correction by phone-level alignment

We use F-Align S2T module in this part.

In each sample, an original audio, a perturbed audio and three corrected audios from three different systems are provided. The original audio is the groundtruth recording in VCTK dataset. THe perturbed audio is obtained by carrying out word-level perturbation, including insertion, replacement and deletion, to the original audio. The corrected audio is obtained by feeding the perturbed audio and unperturbed text to the system as input.


Sample:1
original audio perturbed audio
original text:initial reports said the aircraft had experienced a loss of power
perturbed text:initial said the aircraft had experienced reported loss of power
Word-Word method corrected audio Word-Phone method corrected audio Phone-Phone method corrected audio




Sample:2
original audio perturbed audio
original text:salary changes are long overdue
perturbed text:salary changes are long number overdue
Word-Word method corrected audio Word-Phone method corrected audio Phone-Phone method corrected audio




Sample:3
original audio perturbed audio
original text:eight months later he was dead
perturbed text:require months later he was dead
Word-Word method corrected audio Word-Phone method corrected audio Phone-Phone method corrected audio




Sample:4
original audio perturbed audio
original text:the hibs manager is no fool
perturbed text:hands the hibs manager is no fool
Word-Word method corrected audio Word-Phone method corrected audio Phone-Phone method corrected audio




Sample:5
original audio perturbed audio
original text:the difference in the rainbow depends considerably upon the size of the drops and the width of the colored band increases as the size of the drops increases
perturbed text:the difference in the depends considerably upon the size of drops and the width tragedy of the colored band office french the size of the drops increases
Word-Word method corrected audio Word-Phone method corrected audio Phone-Phone method corrected audio




3.Application for accent reduction
In this part, we apply CorrectSpeech to L2-ARCTIC dataset for accent reduction

F-Align+Word-Phone: In this setting, F-Align S2T module is combined with Word-Phone correction method.
F-Align+Phone-Phone: In This setting, F-Align S2T module is combined with Phone-Phone correction method.

In each sample, an accented audio and a modified audio are provided. The accented audio is the groundtruth recording in L2-ARCTIC dataset. The modified audio is obtained by feeding the accented audio and annoated text to the system as input.


Sample:1
accented audio
F-Align+Word-Phone system corrected audio F-Align+Phone-Phone system corrected audio



Sample:2
accented audio
F-Align+Word-Phone system corrected audio F-Align+Phone-Phone system corrected audio



Sample:3
accented audio
F-Align+Word-Phone system corrected audio F-Align+Phone-Phone system corrected audio



Sample:4
accented audio
F-Align+Word-Phone system corrected audio F-Align+Phone-Phone system corrected audio



Sample:5
accented audio
F-Align+Word-Phone system corrected audio F-Align+Phone-Phone system corrected audio