Speech Recognition in Python Using OpenAI Whisper
In this blog, we dive into OpenAI Whisper, a versatile and open source speech recognition model. We’ll explore its core features, including multilingual transcription and walk through practical code examples to help you get started with real-world applications in speech-to-text and voice processing tasks.

Table of Contents
1) What is OpenAI Whisper?
It is a general purpose, Transformer based speech recognition model. It was trained on 6,80,000 hours of multilingual and multitask supervised data collected from the web. This includes a wide range of tasks such as:
- Multilingual speech recognition
- Spoken language identification
- Voice activity detection
- Speech translation
It processes input audio by converting it into 30-second chunks, transforming those into log-Mel spectrograms, and passing them through an encoder-decoder architecture. These tasks are jointly represented as a sequence of tokens for the decoder to predict, replacing multiple components of traditional speech-processing pipelines.
The model is known for its robustness across accents, background noise, and technical language. However, its performance may vary across languages, especially those with limited training data.
2) Installing Whisper in Python
It requires Python (3.8–3.11 recommended) and PyTorch.
Note: I am using Python
3.13.2version for this exercise.
python -m pip install -U openai-whisper
Helper libraries
- Package
sounddeviceis used for recording from mic. - And
numpyis used to store recorded sound in-memory.
python -m pip install sounddevice numpy
FFmpeg: It is a complete, cross-platform solution to record, convert and stream audio and video.
It is required by Whisper to be installed on your system. It is available from most package managers.
# on Windows using Chocolatey (https://chocolatey.org/)
# choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
# scoop install ffmpeg
# on Ubuntu or Debian
# sudo apt update && sudo apt install ffmpeg
# on Arch Linux
# sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
# brew install ffmpeg
3) Available Whisper Models
It offers six model sizes with trade-offs between speed, accuracy, and VRAM requirements. English-only (.en) variants often perform better for English inputs.
| Size | Params | English-only | Multilingual | VRAM | Speed |
|---|---|---|---|---|---|
| tiny | 39M | tiny.en | tiny | ~1 GB | ~10x |
| base | 74M | base.en | base | ~1 GB | ~7x |
| small | 244M | small.en | small | ~2 GB | ~4x |
| medium | 769M | medium.en | medium | ~5 GB | ~2x |
| large | 1550M | N/A | large | ~10 GB | 1x |
| turbo | 809M | N/A | turbo | ~6 GB | ~8x |
The turbo model is an optimized version of large-v3, offering faster transcription with minimal accuracy trade-off.
See Model Card for latest information on models.
4) Python Code Example: Basic Transcription
Here’s how to transcribe audio using Whisper in Python:
import whisper
model = whisper.load_model("medium")
result = model.transcribe("<Enter audio file path>")
print(result["text"])
Internally, it uses a 30-second sliding window to process audio, making autoregressive predictions for each segment.
5) Python Code Example: Live speech Transcription
This example demonstrates how to record audio live from your microphone using Python and transcribe it on the fly using OpenAI Whisper. It uses the sounddevice library to capture audio and runs the transcription in-memory, no audio file is saved.
This is useful for building offline dictation tools, accessibility utilities, or voice-controlled interfaces.
import whisper
import sounddevice as sd
import numpy as np
# Load the Whisper model
model = whisper.load_model("medium")
# Audio settings
samplerate = 16000 # Whisper expects 16000 Hz audio
duration = 5 # seconds to record
print("Speak into your microphone...")
# Record audio from the default microphone
audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='float32')
sd.wait()
print("Recording stopped.")
# Convert audio to mono and scale to int16 (Whisper expects this format)
audio_np = np.squeeze(audio)
print("Audio conversion complete.")
# Whisper expects a numpy array with shape (samples,) and float32 in range [-1.0, 1.0]
# Transcribe directly without saving to a file
result = model.transcribe(audio_np, fp16=False) # Set fp16=False if you're on CPU
print("Transcription:" + result["text"])
Here’s your updated section with a clear, concise description added under the title. It maintains technical accuracy and is consistent with your style and content.
6) Python Code Example: Language Detection
In addition to high-level transcription, it provides low-level access to internal processes such as language detection. This is useful when you need more control over audio processing, for instance, identifying the spoken language before choosing how to handle or translate the input.
The following code:
- Loads an audio file.
- Converts it to a log-Mel spectrogram.
- Detects the spoken language.
import whisper
model = whisper.load_model("medium")
# Load and prepare audio
audio = whisper.load_audio("<Enter audio file path>")
audio = whisper.pad_or_trim(audio)
# Create log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
# Detect language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
7) Licensing and Intended Use
Whisper is released under the MIT License. While its capabilities are powerful and accessible, it is primarily intended for:
- AI research on model robustness and generalization
- Developers building accessible ASR tools (especially in English)
Note: The model is not suitable for real-time transcription out-of-the-box and may show varying performance across languages. It also has potential dual use implications, users should avoid deploying it in sensitive or high risk decision making environments without proper evaluation.
8) Full Source Code on GitHub
You can find the complete source code for this project in the GitHub repository below:
