The Evolution of Voice
Ever since we first asked Siri to tell us the weather outside in February 2010 —and she did—the consumer and commercial applications for voice technology have been discussed, deliberated upon, and debated from bootstrap offices at small start-ups to corporate boardrooms at Fortune 500 companies.
As CEO of Novel Effect, I have been in both of those rooms. Founded in 2015 by me, my wife Melissa, and our sister-in-law Melody Furze, Novel Effect began with a very simple mission: to make reading books aloud with your kids fun and engaging at a time when screens seemed to be taking time and attention away from that bonding experience.
Over the course of the next year, we built a screen-free, voice-driven storytelling platform that enhances story time by syncing background music and sound effects to a reader’s voice as a book is read aloud, creating a new interactive experience around a traditional print product. A little less than three years later, we closed our Series A funding round with Amazon and a select number of well-known VC firms and are now a leader in voice-driven media and entertainment.
Today, I regularly meet with executives across media from cable distribution and streaming services to content studios and film production. Over the past year, I have seen a significant shift in the interest and commitment these companies are devoting to the development of voice-driven experiences. Because of this, I firmly believe that voice is at a pivotal turning point driven by a consumer base that is ready to embrace all the possibilities voice-interactivity presents.
Siri, Alexa, and Google Home have all been game changers in this space with the use cases for call-and-response voice interaction constantly being adapted for the convenience of the consumer. It still amazes me that you can turn off the lights in your home from 2,000 miles away with a simple voice command.
However, while the experiences are getting admittedly more sophisticated, the user experience is still very similar to the first time you asked Siri about the temperature.
The experience goes something like this:
What has changed over the past ten years is the user. They have come to accept, and perhaps even expect, that voice-enabled technology should be a part of their everyday lives. In large part, this correlates to our growing dependence on our devices.
According to a recent Pew study, 77% of Americans have smart phones with nearly one-third of American households having at least three. A Nielsen study further notes that 25% of households now have one smart speaker, with 40% of those households owning more than one. With this proliferation of device ownership, it isn’t surprising to hear (no pun intended) that by the year 2020 55% of all searches will be voice based.
Beyond search, there are real impacts for advertising, commerce, and content. At the most basic levels, we’ve seen higher engagement with existing content as users play podcasts, music channels, and audiobooks through their in-home assistants. The more forward-thinking developers have started to adapt existing audio content into more interactive activities for Alexa Skills and Google Actions.
Ultimately, thinking creatively about content is what will open the door for publishers, game developers, animation studios, and others to drive innovation in the voice space and grow beyond the home assistant call-and-response experience. When asked about the future of voice, I often challenge creators and developers to think about voice-interactivity as a marketing and brand development tool and how they can create engaging voice experiences that can be shared with friends and family to drive brand awareness and loyalty.
To think creatively about the voice space, it is helpful to understand a bit about how voice recognition works.
The Science of Voice
Excuse me while I get a bit technical for a minute on the complex subject area of speech recognition, the concepts of phones and phonemes, and the difference between our brains and computers. I promise to keep it light!
When I say a word out loud, my voice generates sounds that corresponds to the letter (or group of letters in the word). These sounds are called phones. If I were to say the word “boy,” the phones produced would correspond to the sounds for “b,” “o,” and “y.”
Related to this is the concept of phonemes, which are the basic sound building blocks that all words are built from. Right about now, you are probably thinking that there isn’t a huge difference between phones and phonemes. But that is because our brains don’t really differentiate between the two.
When we listen to speech, our brains skip right to the phones and turn them back into words, sentences, thoughts, and ideas, sometimes even anticipating what people are going to say before they finish getting their thoughts out.
Even with the rise of machine learning, deep learning, and natural language processing capabilities, computers are still learning how to skip and anticipate the intent of the words being spoken. The computers learn by absorbing the phonemes as well as the phones. They have gotten pretty good at this and are working toward gaining that intent knowledge.
Disclaimer: This is to inform readers that the views, thoughts, and opinions expressed in the article belong solely to the author, and do not reflect the views of Amnet.
Copyright © 2020 Amnet. All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law. For permission requests, write to John Purcell, Executive Editor- Amnet, addressed “Attention: Permissions” and email it to: [email protected]