This article may contain personal views and opinion from the author.
I love voice control. Let's just get that out there right from the start. I may be a writer, meaning that I am at my best when conveying my thoughts through the written word and not on the spot talking; but, I am also a lazy man, and I like to be able to get things done with the minimal amount of interaction with my computing devices. As such, I can't help but wonder: why can't Google Now, Siri, and Cortana offer full voice control?
As already mentioned, I love voice control. It is one of the main reasons why I traded in my Nexus 5 for a Moto X
- I wanted the Touchless Controls. And, as much as I love Touchless Controls with my Moto X, I can't help but want more. The trouble is that there is a limit to what I can accomplish with voice commands alone. There is a huge assortment of options for voice commands. I can send emails, texts, navigate to websites, ask questions, get directions, set alarms, set reminders, play music, and plenty more. The trouble is that once that first command is done, there is nothing left for me with voice command.
One of the best innovations of recent years is in Google's conversational speech recognition in Search. From a technical standpoint, it means that Google can understand pronouns, and connect them to previous requests. So, if you as about Kawhi Leonard in one voice action, then ask a follow-up question using the pronoun "him", Google will understand and give you the information that you want. That is an amazing piece of tech that most don't fully appreciate. It creates a back-and-forth with your device that feels natural. Unfortunately, that back-and-forth doesn't extend into more useful scenarios.
It's nice to be able to run follow-up commands, but the current implementation is fairly limited. I simply don't have many instances where I need to ask a follow-up question about a person or place. I would much rather be able to continue a device command in that same conversational way. My issues come from how other voice commands don't contain similar follow-up scenario options. For example, let's say that I ask my Moto X to play a song by Me'Shell Ndegéocello, because I haven't yet had a chance to listen to her new album. That first request should go through without a hitch (assuming I can pronounce her name correctly, otherwise I'll just opt for a safer voice recognition name like Gregory Porter.) The trouble is that once the music starts, my voice command options run dry. All I can do from there is submit a voice command to play another artist or song. But, what I really want to be able to do is tell my device to do one of a multitude of things, like "pause", "next track", "lower/raise volume", or repeat track. Unfortunately, I can't.
I don't really understand why I can't do this. From a technical standpoint, there are almost no barriers to allowing me this sort of full voice control over my device. Starting with voice recognition, we're golden. All voice command systems can understand simple words like: play, pause, next, previous, repeat, etc. As far as a touchless trigger, that's possible too. Google has recently expanded its hotwords to allow for the "OK, Google" command to be initiated from anywhere. There are rumors that the next iPhone will offer similar functionality for Siri; and, there's no reason why Cortana couldn't do the same for Windows Phone users. Always listening is becoming the norm, so that shouldn't be an issue.
I can understand that more voice interaction would likely mean more drain on the battery, which is always a point of concern for manufacturers; but, it seems like a problem with a relatively easy solution. A device that is "always-listening" is already possible, especially when the device has a companion core or optimized processor (anything from a Snapdragon 800 and newer) specifically dedicated to listen for voice commands. That takes care of the battery issues. The other side of the issue should be a simple API, at least to get things started.
That is what Ubuntu Touch is planning to implement. Once you're inside an app, there is a fairly limited selection of commands that one might want to use via voice. News apps and other reading apps might not have much use for voice command, but even implementing simple commands, like "back", "scroll down/up", "search" and "share to..." would add a wealth of functionality for the vast majority of apps. Once you jump specifically into apps that would have more options for standard voice commands, like media consumption apps, the possibilities become much clearer. Imagine having full media controls with voice, like "play/pause", "next/previous", "rewind/fast forward". "volume up/down", or even "skip to (time)". Of course, even dynamic commands shouldn't be a trouble because in-app commands will mostly be one or two word commands, many of which would overlap between apps, allowing for easier implementation of a standard API; and as mentioned before, the recognition for those commands shouldn't be a problem.
Who does it first?
It's not like this sort of functionality is completely new. Windows 7 and 8 offer much broader voice command functionality, allowing for full navigation of the screen just with voice commands. Many would say that's desktop, and mobile is a different world with more limited options, but that sort of thinking doesn't hold as true anymore. Mobile platforms are becoming more and more advanced, and bridging the functionality gap with desktops in many ways. One of the big plans for Ubuntu Touch has been to allow for wider voice commands within apps. One of the first demos that Canonical showed had the standard items in a dropdown menu being actionable via voice, meaning in-app search, and commands like "open", "save", "crop", etc.
Canonical has not yet gotten that functionality working in Ubuntu Touch, but frankly there is still a lot in Ubuntu Touch that doesn't yet work to its full potential. My question is in regards to the established platforms. Sure, Google and Apple continue to expand the functionality of Google Now and Siri, respectively, and Microsoft looks to be coming out of the gate with and impressive feature set for Cortana; but, none appear to have any plans to offer full voice control, which is pretty disappointing. The best we can hope for right now is a back-and-forth conversation to make sure that your voice command is handled properly, and that all of the relevant information is included like with calendar events or reminders.
In the end, we're definitely going to get full voice control; it's more a matter of who implements it first. As mentioned, Microsoft has it working in Windows, but not Windows Phone. Microsoft has stated intentions to bring "Kinect-like" control to its platforms, but there is no way to tell what the timeline is on those features. It seems most likely to be in Windows Phone 9, which is expected next year. Canonical is building it for Ubuntu, but it isn't ready yet. Apple hasn't given any outward appearance that it even has this functionality on its radar yet, but it seems likely that it is at least in R&D. Samsung also hasn't shown any inclination towards this feature. Samsung already offers some features like this, and S Voice is powered by Nuance, which is also behind Siri's voice recognition. Obviously the capabilities are there, but Samsung (not surprisingly) has the features limited to its own apps, and not globally on its devices. That just leaves Google.
In various Android Wear videos, Google has teased that there is an expansion of voice commands on the way. One video showed someone on a bike using a command like "OK Google, open the garage door". Unfortunately, it's hard to tell what this means. It could be that Google will be opening up voice commands to developers, allowing for deeper integration into apps and for developers to create custom voice actions. It seems more that it will be a new set of standard actions that apps can hook into, like how the standard "note to self" command can be used with email, Keep, Evernote, and other apps. Google has shown an option to say "OK Google, call me a car", and let you choose an app to handle that request. The first option could lead to a lot more functionality, although it would be something of a mess. The latter would keep functionality limited, but more consistent. Either way, it does look like Google will be the first to add more full-featured voice control.
The "What?" and "Why?" are easy: full voice control, because we all want to live in Star Trek. The "How?" also seems to be answered: always-listening and APIs. The answer to "Who?" is really everyone, but it does look like Google will be the first out of the gate to offer full voice control. So, that just leaves the last question: "When?"
Given what Google has teased, it's hard to say that full voice control would start to roll out before the end of this year. The functionality would probably need to be part of Android L, and Google made no real mention of it during the I/O keynote. This kind of deeper integration into apps would need to be at the system level, and not just use Android's app handler calls. It does seem like Google may at least be putting down the foundations for full voice control. Unfortunately, regardless of your platform of choice, it is likely that full voice control isn't in the cards until 2015 at the earliest. I'm a patient person, but that seems like a long time to wait for a feature that should already be in the works by all of the big platforms.