Speech recognition is a really hard problem that has been well researched over the years. With the advances in machine learning, a lot of progress has been made in this regard. Evidence of this can be found in the technology that we use daily: Apple's Siri, Amazon's echo, to mention a few. While APIs such as those provided by Google and IBM's watson exist to service professional needs, a relatively new feature for the evolving browser technology is the Web Speech API.
In this article we will build a custom voice assistant to control music using web technologies. We will be using the SpeechRecognition API, Google Chrome (this is the only browser with that supports this API at the time of writing), annyang (a lightweight voice command parsing library), Node.js to provide an API for commands and Clementine (a cross-platform audio player).
The SpeechRecognition API
Before we jump into the API description there are a couple of important points to note:
- The API is only supported by Google's Chrome browser at the time of writing.
- An active internet connection is required to use it. The service URI can be altered but by default it uses the user agent's default speech service. Chrome most likely uses Google's Speech API (however there is no outgoing network request indication to confirm this).
- The API is marked as experimental and may change in the future.
- On activation, the browser will ask for permissions to use the microphone and speakers. Check the settings to ensure that these permissions are not blocked.
- All code samples use ES6 features since a modern brower is required here in any case.
The API is rather well documented
here and the using it is as straightforward as creating a new
SpeechRecognition
object, adding the
onresult
event handler and calling the
start()
method whenever you are ready to listen:
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();
recognition.onresult = (event) => {
// event is a SpeechRecognitionEvent and it holds all the lines captured so far
// get the current line
var current = event.resultIndex;
// get the recognized text
var transcript = event.results[current][0].transcript;
console.log(transcript);
}
recognition.start();
Note that when you start the voice recognition, it listens for as long as there is some speech and then processes the audio after a few seconds of silence. There is also a maximum length of time for which it will listen - around 60 seconds by rough estimates.
By default, the
SpeechRecognition
object will only listen to one chunk of audio but it can also be configured to continuously listen
via a
continuous
property. However, in practice it was noticed that the speech recognition would stop after a maximum
of 5 minutes even when this flag was set to
true
. Therefore, to enable continuous listening we can restart the speech recognition at the
onend
event and keep the flag set at
false
:
recognition.onstart = () => {
console.log('voice recognition activated');
}
recognition.onend = () => {
console.log('voice recognition ended');
recognition.start();
}
recognition.onerror = (event) => {
console.error(event);
}
Command parsing
Now that we have the text of what was said we can look into building a command parser. Before we do this however, a quick note on the accuracy of speech recognition of today: it is extremely good. Atleast on Chrome, basic phrases were accurately recognized even with music playing on the laptop plus background voices in an office setting!
However, it might be the case that you need to repeat the command a few times if it is very similar to other words. Therefore, it is essential to use phonetically distinct words for commands. Moreover, try to avoid proper names such as Berlin, San Fransisco, Donald, etc. as the error rate on these is high.
Writing a basic command parser can be as simple as a
if (transcript.toLowerCase().includes('music play')) { ... }
or something more structured like the
annyang library:
if (annyang) {
// Let's define our first command. First the text we expect, and then the function it should call
// action is a variable and will get passed to the calling function
var commands = {
'music :action': function(action) {
console.log(action);
}
};
// Add our commands to annyang
annyang.addCommands(commands);
// Start listening. You can call this here, or attach this call to an event, button, etc.
annyang.start();
}
Note that annyang actually takes over the SpeechRecognition API usage as well so you can get started really quickly! Check out their docs for more ideas.
Controlling music
While using a voice interface is great for improving accessiblity, if you are sitting on a computer then you are likely faster programming a few keyboard shortcuts to get what you need done.
I personally, however find it useful to be able to control music while I am cooking or am in the room without needing to go to the computer. Fortunately, my music player of choice is Clementine, which is cross-platform and also allows provides player control via commandline parameters.
Thus, all we need now is to somehow convert the recognized voice command to execute the clementine binary with the appropriate commandline parameter. Enter Node.js - lets whip up a really quick HTTP endpoint using express:
const api = express();
api.get('/music/:action', (req, res) => {
// do something here
...
res.sendStatus(200);
});
api.listen(3001);
Our speech recognition webpage is running on
localhost
and can therefore make http calls to another service running on
localhost
:
// annyang commands
var commands = {
'music :action': (action) => {
const url = 'http://localhost:3001/music/' + action;
return fetch(url, {
method: 'GET',
headers: {
'Content-Type': 'application/json'
}
})
.then(() => {
console.log('executed action');
})
};
Within our API endpoint handler, we will use Node's
child_process
module to fire up the clementine binary with the appropriate commandline:
api.get('/music/:action', (req, res) => {
const options = parseOptions(req.params.action.toLowerCase());
const command = spawn('clementine', options, {
detached: true,
stdio: 'ignore'
});
command.on('close', (code, signal) => {
const optionsStr = options && options.length ? options.join(' ') : ''
this
.logger
.info(`finished executing command to ${this.bin} with ${optionsStr}`);
});
command.unref();
res.sendStatus(200);
});
where
parseOptions
takes an action string an returns the appropriate commandline parameter:
const parseOptions = (action) => {
let options = [];
let flag = '';
switch (action) {
case 'play':
flag = '-p';
break;
case 'stop':
flag = '-s';
break;
case 'pause':
flag = '-u';
break;
case 'volume-up':
flag = '--volume-up';
break;
case 'volume-down':
flag = '--volume-down';
break;
case 'next':
flag = '-f';
break;
case 'previous':
flag = '-r';
break;
}
if (action.startsWith('set-volume')) {
const split = action.split('-');
flag = '-v ' + split[2];
}
if (action.startsWith('seek-by')) {
const split = action.split('-');
options = ['--seek-by', split[2]];
}
if (options.length) {
return options;
}
return [flag];
}
And voila, we can now control our music player just by shouting out "music play" or "music next" to it! As mentioned earlier, even with a bit of a distance and the music playing at a decent volume, the speech recognition is uncannily accurate. The only price to pay is Google's A.I. judging you based on your taste in music :)
More features
The web speech API also provides a text to speech service which was not covered here. However, it is also relatively straightforward to use. There were some issues with the text being cut off after 200 characters but that can also be fixed with a few hacks :)
You can find the source code for the frontend on github. It is built using React and uses a very simple custom command parser because I did not know about annyang at the time of coding it. The source code for the backend is split into a few different microservices and will be published soon - most of it is already provided in the samples above.
This article focused on controlling music but the possiblities are endless. Ask your computer how much time you have to get to the train before you jump into the shower, control IOT connected devices, etc.
If you do come up with some great ideas or just want some more help on the above please do get in touch!