The Tech Tree – Speech Synthesis

Posted by Communique at 2:26 PM on Aug 19, 2019

With all the exciting progress we see in today’s world, it only seems appropriate to have a long-running series dedicated to commenting on the possibilities of all these new ideas and technology. The Tech Tree articles did well, so why not extend it indefinitely?

First off, we know speech synthesis as the process of machine-driven language based off of written input. Text-to-speech programs are the perfect example of this: you input some text, and a computer vocalizes the words – simple and not exactly revolutionary. In fact, for a long time speech synthesis has not really been used beyond corporate voice message receivers and passive aggressive viral videos. Yet all of a sudden there is widespread interest in the field, with both Microsoft, Google and other tech giants offering API solutions for websites and apps to feature speech synthesis features. What changed?

Much like the rise of mobile tech being the missing piece of the puzzle that was early 2010s graphic design, machine-learning is quickly finding applications across sectors of industry once considered ‘dead-ends’. Elements of it were beginning to be noticed with the use of ‘chat bots’ and other applications that remembered user input to better understand later interactions – famous examples being Rollo Carpenter’s Cleverbot and Microsoft’s controversial Tay project. Despite how they were being used, it became apparent that Artificial Intelligence didn’t need true intelligence, so to say, but rather needed to learn enough patterns to be able to form a coherent view on something. Fast forward a decade, and now we have applications that can machine-learn human voice patterns to synthesize replicas of actual people. You might be thinking, so what, it’s just another text-to-speech gimmick. You might be right if speech synthesis existed as its own thing, just the same as ‘chat bots’ would, but we already know from software like Siri, Alexa and Cortana that these elements are being combined. Developers are able to create applications that can process human queries (e.g, questions or requests typed or spoken aloud) and return answers in a format that we can understand: as intelligible written language and basic vocal responses. Things will only improve once speech synthesis is improved by advancements in machine-learning procedures.

The possibilities for its potential in human interaction really is exciting - not just for the future, but in the present. As stated earlier, we already have companies like Google offering analysis APIs like Natural Language that you are free to download and develop with from the get go. Implementation of speech synthesis with this kind of analysis really does offer a bridge between the old world of desktop-oriented queries and the new desire for ‘seamless’ interaction with online information flow. No doubt we will see the benefits of this tech within the coming years, but if you have knowledge and experience to work with this kind of software, the opportunities are already here.