Victor is a full stack software engineer who loves travelling and building things. Most recently created Ewolo, a cross-platform workout logger.
Building a voice assistant to control music

Speech recognition is a really hard problem that has been well researched over the years. With the advances in machine learning, a lot of progress has been made in this regard. Evidence of this can be found in the technology that we use daily: Apple's Siri, Amazon's echo, to mention a few. While APIs such as those provided by Google and IBM's watson exist to service professional needs, a relatively new feature for the evolving browser technology is the Web Speech API.

In this article we will build a custom voice assistant to control music using web technologies. We will be using the SpeechRecognition API, Google Chrome (this is the only browser with that supports this API at the time of writing), annyang (a lightweight voice command parsing library), Node.js to provide an API for commands and Clementine (a cross-platform audio player).

The SpeechRecognition API

Before we jump into the API description there are a couple of important points to note:

  1. The API is only supported by Google's Chrome browser at the time of writing.
  2. An active internet connection is required to use it. The service URI can be altered but by default it uses the user agent's default speech service. Chrome most likely uses Google's Speech API (however there is no outgoing network request indication to confirm this).
  3. The API is marked as experimental and may change in the future.
  4. On activation, the browser will ask for permissions to use the microphone and speakers. Check the settings to ensure that these permissions are not blocked.
  5. All code samples use ES6 features since a modern brower is required here in any case.

The API is rather well documented here and the using it is as straightforward as creating a new SpeechRecognition object, adding the onresult event handler and calling the start() method whenever you are ready to listen:


const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.onresult = (event) => {
  // event is a SpeechRecognitionEvent and it holds all the lines captured so far
  
  // get the current line
  var current = event.resultIndex; 

  // get the recognized text
  var transcript = event.results[current][0].transcript;

  console.log(transcript);
}

recognition.start();

Note that when you start the voice recognition, it listens for as long as there is some speech and then processes the audio after a few seconds of silence. There is also a maximum length of time for which it will listen - around 60 seconds by rough estimates.

By default, the SpeechRecognition object will only listen to one chunk of audio but it can also be configured to continuously listen via a continuous property. However, in practice it was noticed that the speech recognition would stop after a maximum of 5 minutes even when this flag was set to true. Therefore, to enable continuous listening we can restart the speech recognition at the onend event and keep the flag set at false:


recognition.onstart = () => {
  console.log('voice recognition activated');
}

recognition.onend = () => {
  console.log('voice recognition ended');
  recognition.start();
}

recognition.onerror = (event) => {
  console.error(event);
}
Command parsing

Now that we have the text of what was said we can look into building a command parser. Before we do this however, a quick note on the accuracy of speech recognition of today: it is extremely good. Atleast on Chrome, basic phrases were accurately recognized even with music playing on the laptop plus background voices in an office setting!

However, it might be the case that you need to repeat the command a few times if it is very similar to other words. Therefore, it is essential to use phonetically distinct words for commands. Moreover, try to avoid proper names such as Berlin, San Fransisco, Donald, etc. as the error rate on these is high.

Writing a basic command parser can be as simple as a if (transcript.toLowerCase().includes('music play')) { ... } or something more structured like the annyang library:


if (annyang) {
  // Let's define our first command. First the text we expect, and then the function it should call
  // action is a variable and will get passed to the calling function
  var commands = {
    'music :action': function(action) {
      console.log(action);
    }
  };

  // Add our commands to annyang
  annyang.addCommands(commands);

  // Start listening. You can call this here, or attach this call to an event, button, etc.
  annyang.start();
}

Note that annyang actually takes over the SpeechRecognition API usage as well so you can get started really quickly! Check out their docs for more ideas.

Controlling music

While using a voice interface is great for improving accessiblity, if you are sitting on a computer then you are likely faster programming a few keyboard shortcuts to get what you need done.

I personally, however find it useful to be able to control music while I am cooking or am in the room without needing to go to the computer. Fortunately, my music player of choice is Clementine, which is cross-platform and also allows provides player control via commandline parameters.

Thus, all we need now is to somehow convert the recognized voice command to execute the clementine binary with the appropriate commandline parameter. Enter Node.js - lets whip up a really quick HTTP endpoint using express:


const api = express();

api.get('/music/:action', (req, res) => {
  
  // do something here
  ...

  res.sendStatus(200);
});    

api.listen(3001);

Our speech recognition webpage is running on localhost and can therefore make http calls to another service running on localhost:


// annyang commands
var commands = {
  'music :action': (action) => {
    const url = 'http://localhost:3001/music/' + action;
    return fetch(url, {
      method: 'GET',
      headers: {
        'Content-Type': 'application/json'
      }
    })
    .then(() => {
      console.log('executed action');
    })
};

Within our API endpoint handler, we will use Node's child_process module to fire up the clementine binary with the appropriate commandline:


api.get('/music/:action', (req, res) => {
  
  const options = parseOptions(req.params.action.toLowerCase());
  const command = spawn('clementine', options, {
    detached: true,
    stdio: 'ignore'
  });
  
  command.on('close', (code, signal) => {
    const optionsStr = options && options.length ? options.join(' ') : ''
    this
      .logger
      .info(`finished executing command to ${this.bin} with ${optionsStr}`);
  });
  
  command.unref();

  res.sendStatus(200);
});    

where parseOptions takes an action string an returns the appropriate commandline parameter:


const parseOptions = (action) => {
  let options = [];
  let flag = '';

  switch (action) {
    case 'play':
      flag = '-p';
      break;
    case 'stop':
      flag = '-s';
      break;
    case 'pause':
      flag = '-u';
      break;
    case 'volume-up':
      flag = '--volume-up';
      break;
    case 'volume-down':
      flag = '--volume-down';
      break;
    case 'next':
      flag = '-f';
      break;
    case 'previous':
      flag = '-r';
      break;
  }

  if (action.startsWith('set-volume')) {
    const split = action.split('-');
    flag = '-v ' + split[2];
  }

  if (action.startsWith('seek-by')) {
    const split = action.split('-');
    options = ['--seek-by', split[2]];
  }

  if (options.length) {
    return options;
  }

  return [flag];
}

And voila, we can now control our music player just by shouting out "music play" or "music next" to it! As mentioned earlier, even with a bit of a distance and the music playing at a decent volume, the speech recognition is uncannily accurate. The only price to pay is Google's A.I. judging you based on your taste in music :)

More features

The web speech API also provides a text to speech service which was not covered here. However, it is also relatively straightforward to use. There were some issues with the text being cut off after 200 characters but that can also be fixed with a few hacks :)

You can find the source code for the frontend on github. It is built using React and uses a very simple custom command parser because I did not know about annyang at the time of coding it. The source code for the backend is split into a few different microservices and will be published soon - most of it is already provided in the samples above.

This article focused on controlling music but the possiblities are endless. Ask your computer how much time you have to get to the train before you jump into the shower, control IOT connected devices, etc.

If you do come up with some great ideas or just want some more help on the above please do get in touch!