Introducing the Web Speech API

February 14, 2014

After receiving my bachelor’s degree, I began working in a group called NLP. As the name implies, we focused on Natural Language Processing (NLP) technologies. At the time, two of the most popular technologies to work with were the VoiceXML standard and Java applets. Both of them had issues. The first was only supported by Opera. The second, used to send the data to the server and execute an action based on the command pronounced by the user, required a lot of code and time. Today things are different. Thanks to the introduction of a dedicated JavaScript API, working with speech recognition has never been easier. This article will introduce you to this API, known as the Web Speech API.

Speech recognition has several real-world applications. Many more people have become familiar with this concept thanks to softwares like Siri and S-Voice. These applications can drastically improve the way users, especially those with disabilities, perform tasks. In a website, users could navigate pages or populate form fields using their voice. Users could also interact with a page while driving, without taking their eyes off of the road. These are not trivial use cases.

What is the Web Speech API?

The Web Speech API, introduced at the end of 2012, allows web developers to provide speech input and text-to-speech output features in a web browser. Typically, these features aren’t available when using standard speech recognition or screen reader software. This API takes care of the privacy of the users. Before allowing the website to access the voice via microphone, the user must explicitly grant permission. Interestingly, the permission request is the same as the getUserMedia API, although it doesn’t need the webcam. If the page that runs this API uses the HTTPS protocol, the browser asks for the permission only once, otherwise it does every time a new process starts.

The Web Speech API defines a complex interface, called SpeechRecognition, whose structure can be seen here. This article won’t cover all the properties and methods described in the specification for two main reasons. The first is that if you’ve seen the interface, it’s too complex to be covered in one article. Secondly, as we’ll see in the next sections, there is only one browser that supports this API, and its implementation is very limited. Therefore, we’ll cover only the implemented methods and properties.

The specification asserts that the API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. It allows two types of recognition: one-shot and continuous. In the first type, the recognition ends as soon as the user stops talking, while in the second it ends when the stop() method is called. In the second case, we can still allow our users to end the recognition by attaching a handler that calls the stop() method (via a button for example). The results of the recognition are provided to our code as a list of hypotheses, along with other relevant information for each hypothesis.

Another interesting feature of the Web Speech API is that it allows you to specify a grammar object. Explaining in detail what a grammar is, is beyond the scope of this article. You can think of it as a set of rules for defining a language. The advantage of using a grammar is that it usually leads to better results due to the restriction of language possibilities.

This API may not surprise you because of the the already implemented x-webkit-speech attribute introduced in Chrome 11. The main differences is that the Web Speech API allows you to see results in real time and utilize a grammar. You can read more about this attribute, by taking a look at How to Use HTML5 Speech Input Fields.

Now that you should have a good overview of what this API is and what it can do, let’s examine its properties and methods.

Methods and Properties

To instantiate a speech recognizer, use the function speechRecognition() as shown below:

var recognizer = new speechRecognition();

This object exposes the following methods:

onstart: Sets a callback that is fired when the recognition service has begun to listen to the audio with the intention of recognizing.
onresult: Sets a callback that is fired when the speech recognizer returns a result. The event must use the SpeechRecognitionEvent interface.
onerror: Sets a callback that is fired when a speech recognition error occurs. The event must use the SpeechRecognitionError interface.
onend: Sets a callback that is fired when the service has disconnected. The event must always be generated when the session ends, no matter what the reason.

In addition to these methods, we can configure the speech recognition object using the following properties:

continuous: Sets the type of the recognition (one-shot or continuous). If its value is set to true we have a continuous recognition, otherwise the process ends as soon as the user stops talking. By default it’s set to false.
lang: Specifies the recognition language. By default it corresponds to the browser language.
interimResults: Specifies if we want interim results. If its value is set to true we’ll have access to interim results that we can show to the users to provide feedback. If the value is false, we’ll obtain the results only after the recognition ends. By default it’s set to false.

As the argument to the result event handler, we receive an object of type SpeechRecognitionEvent. The latter contains the following data:

results[i]: An array containing the results of the recognition. Each array element corresponds to a recognized word.
resultIndex: The current recognition result index.
results[i].isFinal: A Boolean that indicates if the the result is final or interim.
results[i][j]: A 2D array containing alternative recognized words. The first element is the most probable recognized word.
results[i][j].transcript: The text representation of the recognized word(s).
results[i][j].confidence: The probability of the result being correct. The value ranges from 0 to 1.

Browser Compatibility

The previous section pointed out that the proposal for the Web Speech API was made in late 2012. So far, the only browser that supports this API is Chrome, starting in version 25 with a very limited subset of the specification. Additionally, Chrome supports this API using the webkit prefix. Therefore, creating a speech recognition object, looks like this in Chrome:

var recognizer = new webkitSpeechRecognition();

Demo

This section provides a demo of the Web Speech API in action. The demo page contains one readonly field and three buttons. The field is needed to show the transcription of the recognized speech. The first two buttons start and stop the recognition process, while the third clears the log of actions and error messages. The demo also allows you to choose between one-shot and continuous mode using two radio buttons.

Because only Chrome supports this API, we perform a check, and if it fails we display an error message. Once support is verified, we initialize the speech recognition object so that we don’t have to perform this action every time the user clicks on the “Play demo” button. We also attach a handler to start the recognition process. Note that inside of the handler, we also set the recognition mode. We need to select the mode inside the handler to assure it reflects the choose of the user (it needs to be refreshed every time a new recognition starts).

A live demo of the code is available here. Oh, and just for fun, try to say a dirty word.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    <title>Web Speech API Demo</title>
    <style>
      body
      {
        max-width: 500px;
        margin: 2em auto;
        font-size: 20px;
      }

      h1
      {
        text-align: center;
      }

      .buttons-wrapper
      {
        text-align: center;
      }

      .hidden
      {
        display: none;
      }

      #transcription,
      #log
      {
        display: block;
        width: 100%;
        height: 5em;
        overflow-y: scroll;
        border: 1px solid #333333;
        line-height: 1.3em;
      }

      .button-demo
      {
        padding: 0.5em;
        display: inline-block;
        margin: 1em auto;
      }
    </style>
  </head>
  <body>
    <h1>Web Speech API</h1>
    <h2>Transcription</h2>
    <textarea id="transcription" readonly="readonly"></textarea>

    <span>Results:</span>
    <label><input type="radio" name="recognition-type" value="final" checked="checked" /> Final only</label>
    <label><input type="radio" name="recognition-type" value="interim" /> Interim</label>

    <h3>Log</h3>
    <div id="log"></div>

    <div class="buttons-wrapper">
      <button id="button-play-ws" class="button-demo">Play demo</button>
      <button id="button-stop-ws" class="button-demo">Stop demo</button>
      <button id="clear-all" class="button-demo">Clear all</button>
    </div>
    <span id="ws-unsupported" class="hidden">API not supported</span>

    <script>
      // Test browser support
      window.SpeechRecognition = window.SpeechRecognition       ||
                                 window.webkitSpeechRecognition ||
                                 null;

      if (window.SpeechRecognition === null) {
        document.getElementById('ws-unsupported').classList.remove('hidden');
        document.getElementById('button-play-ws').setAttribute('disabled', 'disabled');
        document.getElementById('button-stop-ws').setAttribute('disabled', 'disabled');
      } else {
        var recognizer = new window.SpeechRecognition();
        var transcription = document.getElementById('transcription');
        var log = document.getElementById('log');

        // Recogniser doesn't stop listening even if the user pauses
        recognizer.continuous = true;

        // Start recognising
        recognizer.onresult = function(event) {
          transcription.textContent = '';

          for (var i = event.resultIndex; i < event.results.length; i++) {
            if (event.results[i].isFinal) {
              transcription.textContent = event.results[i][0].transcript + ' (Confidence: ' + event.results[i][0].confidence + ')';
            } else {
              transcription.textContent += event.results[i][0].transcript;
            }
          }
        };

        // Listen for errors
        recognizer.onerror = function(event) {
          log.innerHTML = 'Recognition error: ' + event.message + '<br />' + log.innerHTML;
        };

        document.getElementById('button-play-ws').addEventListener('click', function() {
          // Set if we need interim results
          recognizer.interimResults = document.querySelector('input[name="recognition-type"][value="interim"]').checked;

          try {
            recognizer.start();
            log.innerHTML = 'Recognition started' + '<br />' + log.innerHTML;
          } catch(ex) {
            log.innerHTML = 'Recognition error: ' + ex.message + '<br />' + log.innerHTML;
          }
        });

        document.getElementById('button-stop-ws').addEventListener('click', function() {
          recognizer.stop();
          log.innerHTML = 'Recognition stopped' + '<br />' + log.innerHTML;
        });

        document.getElementById('clear-all').addEventListener('click', function() {
          transcription.textContent = '';
          log.textContent = '';
        });
      }
    </script>
  </body>
</html>

Conclusion

This article introduced the Web Speech API, and explained how it can help improve user experience, especially for those with disabilities. The implementation of this API is at a very early stage, with only Chrome offering a limited set of features. The potential of this API is incredible, so keep an eye on its evolution. As a final note, don’t forget to play with the demo, it’s really entertaining.