Google's Translatotron Can Translate Speech Directly

Aadhya Khatri - May 16, 2019

Translatotron is a huge milestone in speech-to-speech translation

Google has made another breakthrough with translation as one of its projects can now take spoken sentences and then translate them into speech in another language. And the level of impressive increases when you know that it works with audio only, there is no text to transcribe the input or the output. This improvement makes the process happens at a quicker rate as well as retains the original tone of the speakers.

The project is called Translatotron, and today’s achievement is the result of several years of hard work. It proves that it can translate speech directly, but the project is still an experiment. Scientists and researchers of Google have spent years trying to find a way to convert speech-to-speech. But it was not until recently that their effort proves to be fruitful.

Before this breakthrough, the process of translating speech happens like this. First, the sentences are transcribed into text. The next step is to translate the text from the source language to the target one. The last step is to turn the text into speech. This is an excellent way to have the desired outcome, but it is far from perfect. Since there are three smaller processes, mistakes can happen all the time, and one can even lead to another.

Translatotron-Spectro — The accuracy is a little bit off but the tones are kept intact

And for what we know now, the best process is happening inside the brain of multilingual people. We do not know for sure how they can translate speech just by hearing it, but the old approach is definitely not what happens. As scientists usually rely on human cognition for suggestions and guidance on how to design their machines, the disparity makes them unease and calls for further improvements.

So to improve their technology, scientists began to research converting spectrograms. The new one is nothing like the old one, Of course, it has its own drawbacks, but it also has benefits.

The first one is the speed. The new approach is more complicated than the old one but in essence, it involves only one step, so if we have enough processing power, you can expect to have output in a shorter time than with the 3-step one.

More importantly, the meaning of the speakers is also conveyed through their tones, so with Translatotron, that feature can be kept intact. What most people can criticize of the current translation process is the way it strips away all of the natural speech rhythms, but with this new technology, the problem is no more.

In practice, the way speakers raise or lower their voice can have a significant impact on the meaning of the sentences. And with Translatotron, people can convey not only the meaning of the speech, but also the way they say them.

However, one of the biggest downers of this method is how accurate the system is. And unfortunately, in this aspect, the new system is not as good as the 3-step one. But one of the reasons here is the old method has been around for quite some time, giving it what it needs to improve itself.

With Translatotron’s ability to send forth tones, scientists have more hope to keep up with the work to enhance the speech-to-speech system.

The team behind the project said that their system is in its infant stage, but it possesses great potential. Despite the inaccuracy, the system is still an important milestone in finding a way for a machine to translate human speech.

You can see for yourself how Translatotron works but do not mind the accuracy of the sample sentences. They are there to demonstrate the system’s ability to retain tones and rhythm of speech.