NYU researchers build a groundbreaking AI speech synthesis system

A team of researchers from New York University has made progress in neural speech decoding, bringing us closer to a future in which individuals who have lost the ability to speak can regain their voice.

The study, published in Nature Machine Intelligence, presents a novel deep learning framework that accurately translates brain signals into intelligible speech.

People with brain injuries from strokes, degenerative conditions, or physical trauma might be able to use such devices to speak using voice synthesizers based on their thoughts alone.

It involves a deep learning model that maps electrocorticography (ECoG) signals to a set of interpretable speech features, such as pitch, loudness, and the spectral content of speech sounds.

ECoG data captures the essential elements of speech production and enables the system to generate a compact representation of the intended speech.

The second stage involves a neural speech synthesizer that converts the extracted speech features into an audible spectrogram, which can then be transformed into a speech waveform.

How the study works

Here’s how it works in greater detail:

Gathering brain data

The first step involves collecting the raw data needed to train the speech-decoding model. The researchers worked with 48 participants who were undergoing neurosurgery for epilepsy.

During the study, these participants were asked to read hundreds of sentences aloud while their brain activity was recorded using ECoG grids. These grids are placed directly on the brain’s surface and capture electrical signals from the brain regions involved in speech production.

Mapping brain signals to speech

Using speech data, the researchers developed a sophisticated AI model that maps the recorded brain signals to specific speech features, such as pitch, loudness, and the unique frequencies that make up different speech sounds.

Synthesizing speech from features

The third step focuses on converting the speech features extracted from brain signals back into audible speech. The researchers used a special speech synthesizer that takes the extracted features and generates a spectrogram—a visual representation of the speech sounds.

Evaluating the results

The researchers compared the speech generated by their model to the original speech spoken by the participants. They used objective metrics to measure the similarity between the two and found that the generated speech closely matched the original’s content and rhythm.

Testing on new words

To ensure that the model can handle new words it hasn’t seen before, certain words were intentionally left out during the model’s training phase, and then the model’s performance on these unseen words was tested.

The model’s ability to accurately decode even new words demonstrates its potential to generalize and handle diverse speech patterns.

The NYU’s voice synthesis system. Source: Nature (open access)

The top section of the above diagram describes a process for converting brain signals to speech. First, a decoder turns these signals into speech parameters over time. Then, a synthesizer creates sound pictures (spectrograms) from these parameters. Another tool changes these pictures back into sound waves.

The bottom section discusses a system that helps train the brain signal decoder by mimicking speech. It takes a sound picture, turns it into speech parameters, and then uses those to make a new sound picture. This part of the system learns from actual speech sounds to improve.

After training, only the top process is needed to turn brain signals into speech.

One key advantage of the NYU approach is its ability to achieve high-quality speech decoding without the need for ultra-high-density electrode arrays, which are impractical for long-term implantation. It’s a more lightweight, portable solution than other research in this field.

Another notable finding is the successful decoding of speech from both the left and right hemispheres of the brain, which is important for potential use in patients with speech loss due to unilateral brain damage.

Converting thoughts to speech using AI

The NYU study builds upon previous research in neural speech decoding and brain-computer interfaces (BCIs).

In 2023, a team at the University of California, San Francisco, enabled a paralyzed stroke survivor to generate sentences at a speed of 78 words per minute using a BCI that synthesized both vocalizations and facial expressions from brain signals.

Other recent studies have explored the use of AI to interpret various aspects of human thought from brain activity. Researchers have demonstrated the ability to generate images, text, and even music from fMRI and EEG data.

For example, a study from the University of Helsinki used EEG signals to guide a generative adversarial network (GAN) in producing facial images that matched participants’ thoughts.

Meta AI also developed a technique for decoding what someone was listening to using brainwaves collected non-invasively.

However, it stopped short of predicting speech from thought alone.

Opportunities and challenges

NYU’s method uses more widely available and clinically viable electrodes than past methods, making it more accessible.

The holy grail would be decoding speech from brainwaves collected with a removable, non-invasive device rather than electrodes physically inserted into the brain.

While these advancements are exciting, major obstacles must be overcome before mind-reading AI can be widely applied.

For one, collecting high-quality brain data requires extensive training for machine learning models, and individual differences in brain activity can make generalization difficult.

Nevertheless, the NYU study represents a stride in this direction by demonstrating high-accuracy speech decoding using lighterweight ECoG arrays.

Looking ahead, the NYU team aims to refine their models for real-time speech decoding, bringing us closer to the ultimate goal of enabling natural, fluent conversations for individuals with speech impairments.

They also intend to adapt the system to include fully implantable wireless devices that can be used in everyday life.