Build a Speech-to-text Web App with Whisper, React and Node

    Abiodun Sulaiman
    Share

    In this article, we’ll build a speech-to-text application using OpenAI’s Whisper, along with React, Node.js, and FFmpeg. The app will take user input, synthesize it into speech using OpenAI’s Whisper API, and output the resulting text. Whisper gives the most accurate speech-to-text transcription I’ve used, even for a non-native English speaker.

    Table of Contents
    1. Introducing Whisper
    2. Prerequisites
    3. Tech Stack
    4. Setting Up the Project
    5. Integrating Whisper
    6. Installing FFmpeg
    7. Trim Audio in the Code
    8. The Frontend
    9. Conclusion

    Introducing Whisper

    OpenAI explains that Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the Web.

    Text is easier to search and store than audio. However, transcribing audio to text can be quite laborious. ASRs like Whisper can detect speech and transcribe the audio to text with a high level of accuracy and very quickly, making it a particularly useful tool.

    Prerequisites

    This article is aimed at developers who are familiar with JavaScript and have a basic understanding of React and Express.

    If you want to build along, you’ll need an API key. You can obtain one by signing up for an account on the OpenAI platform. Once you have an API key, make sure to keep it secure and not share it publicly.

    Tech Stack

    We’ll be building the frontend of this app with Create React App (CRA). All we’ll be doing in the frontend is uploading files, picking time boundaries, making network requests and managing a few states. I chose CRA for simplicity. Feel free to use any frontend library you prefer or even plain old JS. The code should be mostly transferable.

    For the backend, we’ll be using Node.js and Express, just so we can stick with a full JS stack for this app. You can use Fastify or any other alternative in place of Express and you should still be able to follow along.

    Note: in order to keep this article focussed on the subject, long blocks of code will be linked to, so we can focus on the real tasks at hand.

    Setting Up the Project

    We start by creating a new folder that will contain both the frontend and backend for the project for organizational purposes. Feel free to choose any other structure you prefer:

    mkdir speech-to-text-app
    cd speech-to-text-app
    

    Next, we initialize a new React application using create-react-app:

    npx create-react-app frontend
    

    Navigate to the new frontend folder and install axios to make network requests and react-dropzone for file upload with the code below:

    cd frontend
    npm install axios react-dropzone react-select react-toastify
    

    Now, let’s switch back into the main folder and create the backend folder:

    cd ..
    mkdir backend
    cd backend
    

    Next, we initialize a new Node application in our backend directory, while also installing the required libraries:

    npm init -y
    npm install express dotenv cors multer form-data axios fluent-ffmpeg ffmetadata ffmpeg-static
    npm install --save-dev nodemon
    

    In the code above, we’ve installed the following libraries:

    • dotenv: necessary to keep our OpenAI API key away from the source code.
    • cors: to enable cross-origin requests.
    • multer: middleware for uploading our audio files. It adds a .file or .files object to the request object, which we’ll then access in our route handlers.
    • form-data: to programmatically create and submit forms with file uploads and fields to a server.
    • axios: to make network requests to the Whisper endpoint.

    Also, since we’ll be using FFmpeg for audio trimming, we have these libraries:

    • fluent-ffmpeg: this provides a fluent API to work with the FFmpeg tool, which we’ll use for audio trimming.
    • ffmetadata: this is used for reading and writing metadata in media files. We need it to retrieve the audio duration.
    • ffmpeg-static: this provides static FFmpeg binaries for different platforms, and simplifies deploying FFmpeg.

    Our entry file for the Node.js app will be index.js. Create the file inside the backend folder and open it in a code editor. Let’s wire up a basic Express server:

    const express = require('express');
    const cors = require('cors');
    const app = express();
    
    app.use(cors());
    app.use(express.json());
    
    app.get('/', (req, res) => {
      res.send('Welcome to the Speech-to-Text API!');
    });
    
    const PORT = process.env.PORT || 3001;
    app.listen(PORT, () => {
      console.log(`Server is running on port ${PORT}`);
    });
    

    Update package.json in the backend folder to include start and dev scripts:

    "scripts": {
      "start": "node index.js",
      "dev": "nodemon index.js",
    }
    

    The above code simply registers a simple GET route. When we run npm run dev and go to localhost:3001 or whatever our port is, we should see the welcome text.

    Integrating Whisper

    Now it’s time to add the secret sauce! In this section, we’ll:

    • accept a file upload on a POST route
    • convert the file to a readable stream
    • very importantly, send the file to Whisper for transcription
    • send the response back as JSON

    Let’s now create a .env file at the root of the backend folder to store our API Key, and remember to add it to gitignore:

    OPENAI_API_KEY=YOUR_API_KEY_HERE
    

    First, let’s import some of the libraries we need to update file uploads, network requests and streaming:

    const  multer  =  require('multer')
    const  FormData  =  require('form-data');
    const { Readable } =  require('stream');
    const  axios  =  require('axios');
    
    const  upload  =  multer();
    

    Next, we’ll create a simple utility function to convert the file buffer into a readable stream that we’ll send to Whisper:

    const  bufferToStream  = (buffer) => {
      return  Readable.from(buffer);
    }
    

    We’ll create a new route, /api/transcribe, and use axios to make a request to OpenAI.

    First, import axios at the top of the app.js file: const axios = require('axios');.

    Then, create the new route, like so:

    app.post('/api/transcribe', upload.single('file'), async (req, res) => {
      try {
        const  audioFile  = req.file;
        if (!audioFile) {
          return res.status(400).json({ error: 'No audio file provided' });
        }
        const  formData  =  new  FormData();
        const  audioStream  =  bufferToStream(audioFile.buffer);
        formData.append('file', audioStream, { filename: 'audio.mp3', contentType: audioFile.mimetype });
        formData.append('model', 'whisper-1');
        formData.append('response_format', 'json');
        const  config  = {
          headers: {
            "Content-Type": `multipart/form-data; boundary=${formData._boundary}`,
            "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
          },
        };
        // Call the OpenAI Whisper API to transcribe the audio
        const  response  =  await axios.post('https://api.openai.com/v1/audio/transcriptions', formData, config);
        const  transcription  = response.data.text;
        res.json({ transcription });
      } catch (error) {
        res.status(500).json({ error: 'Error transcribing audio' });
      }
    });
    

    In the code above, we use the utility function bufferToStream to convert the audio file buffer into a readable stream, then send it over a network request to Whisper and await the response, which is then sent back as a JSON response.

    You can check the docs for more on the request and response for Whisper.

    Installing FFmpeg

    We’ll add additional functionality below to allow the user to transcribe a part of the audio. To do this, our API endpoint will accept startTime and endTime, after which we’ll trim the audio with ffmpeg.

    Installing FFmpeg for Windows

    To install FFmpeg for Windows, follow the simple steps below:

    1. Visit the FFmpeg official website’s download page here.
    2. Under the Windows icon there are several links. Choose the link that says “Windows Builds”, by gyan.dev.
    3. Download the build that corresponds to our system (32 or 64 bit). Make sure to download the “static” version to get all the libraries included.
    4. Extract the downloaded ZIP file. We can place the extracted folder wherever we prefer.
    5. To use FFmpeg from the command line without having to navigate to its folder, add the FFmpeg bin folder to the system PATH.

    Installing FFmpeg for macOS

    If we’re on macOS, we can install FFmpeg with Homebrew:

    brew install ffmpeg
    

    Installing FFmpeg for Linux

    If we’re on Linux, we can install FFmpeg with apt, dnf or pacman, depending on our Linux distribution. Here’s the command for installing with apt:

    sudo apt update
    sudo apt install ffmpeg
    

    Trim Audio in the Code

    Why do we need to trim the audio? Say a user has an hour-long audio file and only wants to transcribe from the 15-minute mark to 45-minute mark. With FFmpeg, we can trim to the exact startTime and endTime, before sending the trimmed stream to Whisper for transcription.

    First, we’ll import the the following libraries:

    const ffmpeg = require('fluent-ffmpeg');
    const ffmpegPath = require('ffmpeg-static');
    const ffmetadata = require('ffmetadata');
    const fs  =  require('fs');
    
    ffmpeg.setFfmpegPath(ffmpegPath);
    
    • fluent-ffmpeg is a Node.js module that provides a fluent API for interacting with FFmpeg.
    • ffmetadata will be used to read the metadata of the audio file — specifically, the duration.
    • ffmpeg.setFfmpegPath(ffmpegPath) is used to explicitly set the path to the FFmpeg binary.

    Next, let’s create a utility function to convert time passed as mm:ss into seconds. This can be outside of our app.post route, just like the bufferToStream function:

    /**
     * Convert time string of the format 'mm:ss' into seconds.
     * @param {string} timeString - Time string in the format 'mm:ss'.
     * @return {number} - The time in seconds.
     */
    const parseTimeStringToSeconds = timeString => {
        const [minutes, seconds] = timeString.split(':').map(tm => parseInt(tm));
        return minutes * 60 + seconds;
    }
    

    Next, we should update our app.post route to do the following:

    • accept the startTime and endTime
    • calculate the duration
    • deal with basic error handling
    • convert audio buffer to stream
    • trim audio with FFmpeg
    • send the trimmed audio to OpenAI for transcription

    The trimAudio function trims an audio stream between a specified start time and end time, and returns a promise that resolves with the trimmed audio data. If an error occurs at any point in this process, the promise is rejected with that error.

    Let’s break down the function step by step.

    1. Define the trim audio function. The trimAudio function is asynchronous and accepts the audioStream and endTime as arguments. We define temporary filenames for processing the audio:

      const trimAudio = async (audioStream, endTime) => {
          const tempFileName = `temp-${Date.now()}.mp3`;
          const outputFileName = `output-${Date.now()}.mp3`;
      
    2. Write stream to a temporary file. We write the incoming audio stream into a temporary file using fs.createWriteStream(). If there’s an error, the Promise gets rejected:

      return new Promise((resolve, reject) => {
          audioStream.pipe(fs.createWriteStream(tempFileName))
      
    3. Read metadata and set endTime. After the audio stream finishes writing to the temporary file, we read the metadata of the file using ffmetadata.read(). If the provided endTime is longer than the audio duration, we adjust endTime to be the duration of the audio:

      .on('finish', () => {
          ffmetadata.read(tempFileName, (err, metadata) => {
              if (err) reject(err);
              const duration = parseFloat(metadata.duration);
              if (endTime > duration) endTime = duration;
      
    4. Trim Audio using FFmpeg. We utilize FFmpeg to trim the audio based on the start time (startSeconds) received and duration (timeDuration) calculated earlier. The trimmed audio is written to the output file:

      ffmpeg(tempFileName)
          .setStartTime(startSeconds)
          .setDuration(timeDuration)
          .output(outputFileName)
      
    5. Delete temporary files and resolve promise. After trimming the audio, we delete the temporary file and read the trimmed audio into a buffer. We also delete the output file using the Node.js file system after reading it to the buffer. If everything goes well, the Promise gets resolved with the trimmedAudioBuffer. In case of an error, the Promise gets rejected:

      .on('end', () => {
          fs.unlink(tempFileName, (err) => {
              if (err) console.error('Error deleting temp file:', err);
          });const trimmedAudioBuffer = fs.readFileSync(outputFileName);
      fs.unlink(outputFileName, (err) => {
          if (err) console.error('Error deleting output file:', err);
      });
      
      resolve(trimmedAudioBuffer);
      
      })
      .on('error', reject)
      .run();
      

    The full code for the endpoint is available in this GitHub repo.

    The Frontend

    The styling will be done with Tailwind, but I won’t cover setting up Tailwind. You can read about how to set up and use Tailwind here.

    Creating the TimePicker component

    Since our API accepts startTime and endTime, let’s create a TimePicker component with react-select.
    Using react-select simply adds other features to the select menu like searching the options, but it’s not critical to this article and can be skipped.

    Let’s break down the TimePicker React component below:

    1. Imports and component declaration. First, we import necessary packages and declare our TimePicker component. The TimePicker component accepts the props id, label, value, onChange, and maxDuration:

      import React, { useState, useEffect, useCallback } from 'react';
      import Select from 'react-select';
      
      const TimePicker = ({ id, label, value, onChange, maxDuration }) => {
      
    2. Parse the value prop. The value prop is expected to be a time string (format HH:MM:SS). Here we split the time into hours, minutes, and seconds:

      const [hours, minutes, seconds] = value.split(':').map((v) => parseInt(v, 10));
      
    3. Calculate maximum values. maxDuration is the maximum time in seconds that can be selected, based on audio duration. It’s converted into hours, minutes, and seconds:

      const validMaxDuration = maxDuration === Infinity ? 0 : maxDuration
      const maxHours = Math.floor(validMaxDuration / 3600);
      const maxMinutes = Math.floor((validMaxDuration % 3600) / 60);
      const maxSeconds = Math.floor(validMaxDuration % 60);
      
    4. Options for time selects. We create arrays for possible hours, minutes, and seconds options, and state hooks to manage the minute and second options:

      const hoursOptions = Array.from({ length: Math.max(0, maxHours) + 1 }, (_, i) => i);
      const minutesSecondsOptions = Array.from({ length: 60 }, (_, i) => i);
      
      const [minuteOptions, setMinuteOptions] = useState(minutesSecondsOptions);
      const [secondOptions, setSecondOptions] = useState(minutesSecondsOptions);
      
    5. Update value function. This function updates the current value by calling the onChange function passed in as a prop:

      const updateValue = (newHours, newMinutes, newSeconds) => {
          onChange(`${String(newHours).padStart(2, '0')}:${String(newMinutes).padStart(2, '0')}:${String(newSeconds).padStart(2, '0')}`);
      };
      
    6. Update minute and second options function. This function updates the minute and second options depending on the selected hours and minutes:

      const updateMinuteAndSecondOptions = useCallback((newHours, newMinutes) => {
          const minutesSecondsOptions = Array.from({ length: 60 }, (_, i) => i);
              let newMinuteOptions = minutesSecondsOptions;
              let newSecondOptions = minutesSecondsOptions;
              if (newHours === maxHours) {
                  newMinuteOptions = Array.from({ length: Math.max(0, maxMinutes) + 1 }, (_, i) => i);
                  if (newMinutes === maxMinutes) {
                      newSecondOptions = Array.from({ length: Math.max(0, maxSeconds) + 1 }, (_, i) => i);
                  }
              }
              setMinuteOptions(newMinuteOptions);
              setSecondOptions(newSecondOptions);
      }, [maxHours, maxMinutes, maxSeconds]);
      
    7. Effect Hook. This calls updateMinuteAndSecondOptions when hours or minutes change:

      useEffect(() => {
          updateMinuteAndSecondOptions(hours, minutes);
      }, [hours, minutes, updateMinuteAndSecondOptions]);
      
    8. Helper functions. These two helper functions convert time integers to select options and vice versa:

      const toOption = (value) => ({
          value: value,
          label: String(value).padStart(2, '0'),
      });
      const fromOption = (option) => option.value;
      
    9. Render. The render function displays the time picker, which consists of three dropdown menus (hours, minutes, seconds) managed by the react-select library. Changing the value in the select boxes will call updateValue and updateMinuteAndSecondOptions, which were explained above.

    You can find the full source code of the TimePicker component on GitHub.

    The main component

    Now let’s build the main frontend component by replacing App.js.

    The App component will implement a transcription page with the following functionalities:

    • Define helper functions for time format conversion.
    • Update startTime and endTime based on selection from the TimePicker component.
    • Define a getAudioDuration function that retrieves the duration of the audio file and updates the audioDuration state.
    • Handle file uploads for the audio file to be transcribed.
    • Define a transcribeAudio function that sends the audio file by making an HTTP POST request to our API.
    • Render UI for file upload.
    • Render TimePicker components for selecting startTime and endTime.
    • Display notification messages.
    • Display the transcribed text.

    Let’s break this component down into several smaller sections:

    1. Imports and helper functions. Import necessary modules and define helper functions for time conversions:

      import React, { useState, useCallback } from 'react';
      import { useDropzone } from 'react-dropzone'; // for file upload
      import axios from 'axios'; // to make network request
      import TimePicker from './TimePicker'; // our custom TimePicker
      import { toast, ToastContainer } from 'react-toastify'; // for toast notification
      
      // Helper functions (timeToSeconds, secondsToTime, timeToMinutesAndSeconds)
      
    2. Component declaration and state hooks. Declare the TranscriptionPage component and initialize state hooks:

      const TranscriptionPage = () => {
        const [uploading, setUploading] = useState(false);
        const [transcription, setTranscription] = useState('');
        const [audioFile, setAudioFile] = useState(null);
        const [startTime, setStartTime] = useState('00:00:00');
        const [endTime, setEndTime] = useState('00:10:00'); // 10 minutes default endtime
        const [audioDuration, setAudioDuration] = useState(null);
        // ...
      
    3. Event handlers. Define various event handlers — for handling start time change, getting audio duration, handling file drop, and transcribing audio:

      const handleStartTimeChange = (newStartTime) => {
        //...
      };
      
      const getAudioDuration = (file) => {
        //...
      };
      
      const onDrop = useCallback((acceptedFiles) => {
        //...
      }, []);
      
      const transcribeAudio = async () => { // we'll explain this in detail shortly
        //...
      };
      
    4. Use the Dropzone hook. Use the useDropzone hook from the react-dropzone library to handle file drops:

      const { getRootProps, getInputProps, isDragActive, isDragReject } = useDropzone({
        onDrop,
        accept: 'audio/*',
      });
      
    5. Render. Finally, render the component. This includes a dropzone for file upload, TimePicker components for setting start and end times, a button for starting the transcription process, and a display for the resulting transcription.

    The transcribeAudio function is an asynchronous function responsible for sending the audio file to a server for transcription. Let’s break it down:

    const transcribeAudio = async () => {
        setUploading(true);
    
        try {
          const formData = new FormData();
          audioFile && formData.append('file', audioFile);
          formData.append('startTime', timeToMinutesAndSeconds(startTime));
          formData.append('endTime', timeToMinutesAndSeconds(endTime));
    
          const response = await axios.post(`http://localhost:3001/api/transcribe`, formData, {
            headers: { 'Content-Type': 'multipart/form-data' },
          });
    
          setTranscription(response.data.transcription);
          toast.success('Transcription successful.')
        } catch (error) {
          toast.error('An error occurred during transcription.');
        } finally {
          setUploading(false);
        }
      };
    

    Here’s a more detailed look:

    1. setUploading(true);. This line sets the uploading state to true, which we use to indicate to the user that the transcription process has started.

    2. const formData = new FormData();. FormData is a web API used to send form data to the server. It allows us to send key–value pairs where the value can be a Blob, File or a string.

    3. The audioFile is appended to the formData object, provided it’s not null (audioFile && formData.append('file', audioFile);). The start and end times are also appended to the formData object, but they’re converted to MM:SS format first.

    4. The axios.post method is used to send the formData to a server endpoint (http://localhost:3001/api/transcribe). Change http://localhost:3001 to the server address. This is done with an await keyword, meaning that the function will pause and wait for the Promise to be resolved or be rejected.

    5. If the request is successful, the response object will contain the transcription result (response.data.transcription). This is then set to the transcription state using the setTranscription function. A successful toast notification is then shown.

    6. If an error occurs during the process, an error toast notification is shown.

    7. In the finally block, regardless of the outcome (success or error), the uploading state is set back to false to allow the user to try again.

    In essence, the transcribeAudio function is responsible for coordinating the entire transcription process, including handling the form data, making the server request, and handling the server response.

    You can find the full source code of the App component on GitHub.

    Conclusion

    We’ve reached the end and now have a full web application that transcribes speech to text with the power of Whisper.

    We could definitely add a lot more functionality, but I’ll let you build the rest on your own. Hopefully we’ve gotten you off to a good start.

    Here’s the full source code: