Facebook Engineers Can Create Uncanny Voice Clone Of Bill Gates

Aadhya Khatri - Jun 12, 2019

We are living in a world where a computer can fake almost anything, and the voice clone of Bill Gates has demonstrated that

We are living in a world where a computer can fake almost anything, and the voice clone of Bill Gates has demonstrated that.

MelNet, the system in question, has generated the audio with what seems to be Bill Gates saying the following sentences: “Write a fond note to the friend you cherish,” “A cramp is no small danger on a swim.” The system is the creation of Facebook engineers.

MelNet can mimic a lot of people, and Bill Gates is just the most famous one. Some other names are Stephen Hawking, Jane Goodall, and George Takei.

MelNet-clone-voice — MelNet made clone voice of Bill Gates and some other public figures

Some of you may wonder why the system chooses these people. The reason here is that what engineers used to train MelNet was 452-hour-long of TED talks. The other source is audiobooks, which are hard subjects as the speakers spoke in a highly animated manner.

What you hear might be impressive, but MelNet is not exactly a breakthrough in this respect. Voice clone technology has seen significant improvements in the last few years. Ever since the release of SampleRNN and WaveNet, we have had a clear idea of how capable these systems can be. WaveNet is a creation of Google’s AI DeepMind lab in London, which is the organization behind Google Assistant.

What scientists do with systems like WaveNet, and SampleRNN is to feed them lots of data so that they can learn the features of human voices. While Google used audio waveforms, MelNet was fed spectrogram, a format with way more information.

Facebook-system-clone-bill-gates-voice — Spectrogram and waveform data

MelNet was accompanied by a paper where Facebook explained that while WaveNet could produce audio with high accuracy, MelNet was more superior when it came to high-level structure, which is the consistencies in human voices that only other human can recognize, but no one can explain them in words.

The density of spectrogram enables scientists to make more consistent voices. The limitation here is that over time, the system cannot keep up with how the voice change. AI-generated texts have a similar issue.

MelNet can not only generate voices only, it is able to produce music too, but the outcome sounds a little bit off.

On the bright side, this technology can lead to better voice assistant and an aid for people with speech impairments. On the other hand, people with shady intentions may fake evident or use the system to do scams.