Transcription Made Easy: Exploring Whisper’s Transcription Features
Amruta Agnihotri
Senior Software Architect
Divya Gupta
Software Engineer
MORE ARTICLES
Introduction
Transcription is a process of transforming spoken words into text, making information more accessible, searchable, and easy to manage. In everyday life, transcription simplifies tasks by allowing people to revisit and analyze conversations, meetings, or lectures without relying solely on memory. Hence, transcription is widely used in fields like medicine, law, academia, business, media, and market research for better accessibility, analysis, and documentation. Transcription is also widely used in applications like voice assistants where it enables voice command recognition, personalized responses, language translation, and accessibility features, making them more intuitive and user-friendly.
When it comes to implementing a transcription solution in your system, there are technically two approaches you can take. One is integrating with cloud transcription services such as Amazon Transcribe (AWS), Google Cloud Speech-to-text, Microsoft Azure Speech-to-text, OpenAI Whisper, and so on. The second way is to deploy and maintain an open-source solution on your own. There are many alternatives available in this space as well such as Vosk, Whisper, DeepSpeech, Athena, and TensorFlowASR to name a few.

While choosing a transcription solution one needs to consider multiple factors such as:

Accuracy, amount of training data used which directly affects accuracy, cost of deployment, supported languages, and latency to name a few important ones.

In this article, we will explore more about using OpenAI’s Whisper model as its accuracy is one of the best amongst available options so far and it can be used in both flavors: On-premise and API based.

About OpenAI Whisper
OpenAI Whisper is an advanced speech recognition system developed by OpenAI. Whisper is designed to transcribe spoken language into written text with high accuracy. This AI model is one of the most accurate automatic speech recognition models. It stands out from the rest of the tools in the market due to the large number of training data sets. It was trained on 680 thousand hours of audio files from the internet. This can be used through the OpenAI Whisper API directly or by downloading the open-source Whisper library.
Integrating with OpenAI’s Whisper APIs
Integrating with API-based Whisper model has some advantages like
  • It can scale effortlessly to handle large volumes of transcription tasks, making it suitable for businesses of all sizes.
  • It can run on normal machines at high speed, i.e. no external instance is required to host the service, thus making it easy to implement and maintain.
  • By using the API, businesses can avoid the overhead costs of maintaining in-house infrastructure for speech recognition, opting instead for a pay-as-you-go model that scales with their needs. Its cost is $0.006 for 60 seconds as of September 2024 when this article is written.
While it’s a convenient choice for quick prototyping, it may not be suitable for all use cases due to its cloud-based nature, which involves sharing data with OpenAI. This could be a concern for some scenarios. Additionally, the service’s performance can be affected by external factors such as server downtime and variable API costs, which means we might not have full control over it.
Owning your AI solution becomes critical for ensuring data privacy and control in certain use cases.
Own your Whisper Transcription Solution
Whisper is an open-source model, and it comes primarily in 6 variants named ‘tiny’, ‘small’, ‘base’, ‘medium’, ‘large’ and ‘turbo’ with the following benefits
  • The model can be deployed in an environment of your choice, on-premise or on your own cloud.
  • Running the Whisper library locally ensures that data remains under your control, ensuring more security and data privacy.
  • The service can be started on a normal machine bearing no cost if either low accuracy with high speed or high accuracy with low speed is accepted.
  • A version of the model can be chosen as per your specific requirements of transcription accuracy and acceptable response time (i.e. whether the requirement is of real time transcription or batch transcription processing)
  • For example, on a 2 vCPUs, 4 GiB RAM instance to process 6 seconds of raw PCM audio buffer, for instance tiny model takes 2-3 seconds, whereas a small model takes 11-12 seconds and 2 GB of RAM.
  • Vertical and Horizontal scalability of the deployment can be managed as per the needs.
Getting the best results with the Whisper Model
Whisper is an open-source model, and it comes primarily in 6 variants named ‘tiny’, ‘small’, ‘base’, ‘medium’, ‘large’ and ‘turbo’ with the following benefits
  • Ensure the audio clip is processed correctly to eliminate/reduce noise and silence buffer. This will not only help in getting more accurate transcriptions but also help in faster processing as the model will get only meaningful data to process. In the case of cloud Whisper API, this will also result in cost savings as the cost is calculated based on the number of seconds of audio data processing.
  • The Whisper model includes Automatic Speech Recognition (ASR), which allows it to automatically detect the spoken language without needing explicit specification. However, for optimal results, it’s recommended to set the spoken language, if possible, especially in noisy environments.
  • Pay attention to the configurable parameters provided by the Whisper model. For example, log_prob_threshold which is used to determine if a segment should be considered as failed based on the average log probability of sampled tokens. Lowering this value makes Whisper more sensitive to small sounds.
  • Another such example is no_speech_threshold. This parameter helps Whisper decide if a segment is silent. If the no_speech probability is higher than this value and the average log probability is below the log_prob_threshold, the segment is considered silent.
  • The Whisper model can transcribe speech from various languages directly into English. This feature is particularly useful in multilingual settings, such as meetings where participants speak different languages, but you need a unified transcription in English.
Author: Amruta Agnihotri and Divya Gupta, Posted on Oct 10, 2024