SPEECH SYNTHESIS FROM NEURAL DECODING OF SPOKEN SENTENCES

Aug 15, 20216 min read

Author: Sandhya Reddy

Neurological conditions that result in the loss of communication are devastating. Many patients rely on alternative communication devices that measure residual nonverbal movements of the head or eyes, or on brain–computer interfaces (BCIs) that control a cursor to select letters one-by-one to spell out words, the one Steven Hawkins used. Although these systems can enhance a patient’s quality of life, most users struggle to transmit more than 10 words per min, a rate far slower than the average of 150 words per min of natural speech.

On 24 April 2019, Gopala K. Anumanchipalli, Josh Chartier and Edward F. Chang published a paper on “Speech synthesis from neural decoding of spoken sentences”. This blog demonstrates their work and the feasibility of a neural speech prosthetic by translating brain signals into intelligible synthesized speech at the rate of a fluent speaker and overcome the constraints of current spelling-based approaches to enable far higher or even natural communication rates.

SPEECH DECODER DESIGN:

Figure-1

High-density electrocorticography (ECoG) signals from five participants were recorded who underwent intracranial monitoring for epilepsy treatment as they spoke several hundreds of sentences aloud. A recurrent neural network was designed that decoded cortical signals with an explicit intermediate representation of the articulatory dynamics to synthesize audible speech. The two-stage decoder approach is a bidirectional long short-term memory (bLSTM) recurrent neural network.

STAGE-1:

It decodes articulatory kinematic features from continuous neural activity recorded from ventral sensorimotor cortex, superior temporal gyrus and inferior frontal gyrus.

STAGE 2:

A separate bLSTM, decodes acoustic features (pitch) and features from stage-1. The audio signal is then synthesized from the decoded acoustic features. To integrate the two stages of the decoder, stage-2 (articulation-to-acoustics) was trained directly on output of stage-1 (brain-to-articulation) so that it not only learns the transformation from kinematics to sound, but also corrects articulatory estimation errors made in stage-1. The encoder was then used to infer the intermediate articulatory representation used to train the neural decoder. With this decoding strategy, it was possible to accurately reconstruct the speech spectrogram.

Figure-2

Overall, detailed reconstructions of speech synthesized from neural activity alone was observed. The audio spectrograms from two original spoken sentences plotted above those decoded from brain activity. The decoded spectrogram retained salient energy patterns that were present in the original spectrogram and correctly reconstructed the silence in between the sentences when the participant was not speaking. To understand to what degree the synthesized speech was perceptually intelligible to naive listeners, two listening tasks were conducted that involved:

Single-word identification
Sentence-level transcription

The tasks were run on Amazon Mechanical Turk, using all 101 sentences from the test set of participant 1.

For the single-word identification task, 325 words that were spliced from the synthesized sentences were evaluated. The effect of word length (number of syllables) and the number of choices (10, 25 and 50 words) were quantified on speech intelligibility, since these factors inform optimal design of speech interfaces. Listeners were more successful at word identification as syllable length increased, and the number of word choices decreased, consistent with natural speech perception.

Figure-3

For sentence-level intelligibility, a closed vocabulary, free transcription task were designed. Listeners heard the entire synthesized sentence and transcribed what they heard by selecting words from a defined pool (of either 25 or 50 words) that included the target words and random words from the test set.

The decoding performance was then quantified at a feature level for all participants. In speech synthesis, the spectral distortion of synthesized speech from ground-truth is commonly reported using the mean mel-cepstral distortion. Mel-frequency bands emphasize the distortion of perceptually relevant frequency bands of the audio spectrogram. For the five participants (participants 1–5), the median MCD scores of decoding speech ranged from 5.14 dB to 6.58 dB.

Figure-4

For each sentence and feature, the Pearson’s correlation coefficient was computed using every sample (at 200 Hz) for that feature. The sentence correlations between the mean decoded acoustic features (consisting of intensity, MFCCs, excitation strengths and voicing) and inferred kinematics across participants were plotted. Prosodic features such as pitch (F0), speech envelope and voicing were decoded well above the level expected by chance (r > 0.6, except F0 for participant 2: r = 0.49 and all features for participant 5; P Correlation decoding performance for all other features were also plotted.

DECODER CHARACTERISTICS:

The following analyses were performed on data from participant 1. When designing a neural decoder for clinical applications, there are several key considerations that determine model performance.

First, in patients with severe paralysis or limited speech ability, training data may be very difficult to obtain. Therefore, the amount of data was analysed that would be necessary to achieve a high level of performance. A clear advantage was found in explicitly modelling articulatory kinematics as an intermediate step over decoding acoustics directly from the ECoG signals. The ‘direct’ decoder was a bLSTM recurrent neural network that was optimized for decoding acoustics (MFCCs) directly from same ECoG signals as used in an articulatory decoder. Performance continued to improve with the addition of data.

Table-1 .

Second, to understand the phonetic properties that were preserved in synthesized speech Kullback–Leibler divergence was used to compare the distribution of spectral features of each decoded phoneme to those of each ground-truth phoneme to determine how similar they were.
Third, since the success of the decoder depends on the initial electrode placement, the contribution of several anatomical regions were extended (vSMC, STG and IFG) that are involved in continuous speech production. Decoders were trained in a leave-one-region-out fashion, for which all electrodes from a particular region were held out. Removing any region led to some decrease in decoder performance. However, excluding the vSMC resulted in the largest decrease in performance.
Fourth, they investigated whether the decoder generalized to novel sentences that were never seen in the training data. Because participant 1 produced some sentences multiple times, they compared two decoders:
One that was trained on all sentences.
The other one that was trained excluding every instance of the sentences in the testing set.

SYNTHESIZING MIMED SPEECH :

To rule out the possibility that the decoder is relying on the auditory feedback of participants’ vocalization, and to simulate a setting in which subjects do not overtly vocalize, they tested our decoder on silently mimed speech.

Figure-5

We tested a held-out set of 58 sentences in which the participant-1 audibly produced each sentence and then mimed the same sentence, making the same articulatory movements but without making sound.

Figure-6

Even though the decoder was not trained on mimed sentences, the spectrograms of synthesized silent speech demonstrated similar spectral patterns to synthesized audible speech of the same sentence. With no original audio to compare, they quantified performance of the synthesized mimed sentences with the audio from the trials with spoken sentences.

Figure-7

They calculated the spectral distortion and correlation of the spectral features by first dynamically time-warping the spectrogram of the synthesized mimed speech to match the temporal profile of the audible sentence and then comparing performance.

Figure-8

STATE–SPACE OF DECODED SPEECH ARTICULATION:

Our findings suggest that modelling the underlying kinematics enhances the decoding performance, to better understand the nature of the decoded kinematics from population neural activity. We examined low-dimensional kinematic state–space trajectories, by computing the state–space projection using principal components analysis onto the articulatory kinematic features.
The first ten principal components (of 33 components in total) captured 85% of the variance and the first two principal components captured 35%. We projected the kinematic trajectory of an example sentence onto the first two principal components. These trajectories were well decoded, as shown in the example and summarized across all test sentences and participants (median r > 0.72 for all participants except participant 5, where r represents the mean r of first two principal components). Furthermore, state–space trajectories of mimed speech were well decoded.
The state–space trajectories appeared to manifest the dynamics of syllabic patterns in continuous speech. The time courses of consonants and vowels were plotted on the state–space trajectories and tended to correspond with the troughs and peaks of the trajectories, respectively.

Figure-9

Next, we sampled from every vowel-to-consonant transition (n = 22,453) and consonant-to-vowel transition (n = 22,453), and plotted 500-ms traces of the average trajectories for principal components 1 and 2 centred at the time of transition (Fig. 4c, d). Both types of trajectories were biphasic in nature, transitioning from the ‘high’ state during the vowel to the ‘low’ state during the consonant and vice versa. When examining transitions of specific phonemes, we found that principal components 1 and 2 retained their biphasic trajectories of vowel or consonant states, but showed specificity towards particular

On Conclusion, in this blog the following topics were covered:

Speech decoder design.
2 Stages in speech decider design.
Speech decoder characteristics.
Results of the synthesizing mimed speech state–space of decoded speech articulation.

Madras Scientific Research Foundation