Speech Recognition in Python Using OpenAI Whisper

In this blog, we dive into OpenAI Whisper, a versatile and open source speech recognition model. We’ll explore its core features, including multilingual transcription and walk through practical code examples to help you get started with real-world applications in speech-to-text and voice processing tasks.

1) What is OpenAI Whisper?

It is a general purpose, Transformer based speech recognition model. It was trained on 6,80,000 hours of multilingual and multitask supervised data collected from the web. This includes a wide range of tasks such as:

Multilingual speech recognition
Spoken language identification
Voice activity detection
Speech translation

It processes input audio by converting it into 30-second chunks, transforming those into log-Mel spectrograms, and passing them through an encoder-decoder architecture. These tasks are jointly represented as a sequence of tokens for the decoder to predict, replacing multiple components of traditional speech-processing pipelines.

The model is known for its robustness across accents, background noise, and technical language. However, its performance may vary across languages, especially those with limited training data.

2) Installing Whisper in Python

It requires Python (3.8–3.11 recommended) and PyTorch.

Note: I am using Python 3.13.2 version for this exercise.

python -m pip install -U openai-whisper

Helper libraries

Package sounddevice is used for recording from mic.
And numpy is used to store recorded sound in-memory.

python -m pip install sounddevice numpy

FFmpeg: It is a complete, cross-platform solution to record, convert and stream audio and video.

It is required by Whisper to be installed on your system. It is available from most package managers.

# on Windows using Chocolatey (https://chocolatey.org/)
# choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
# scoop install ffmpeg

# on Ubuntu or Debian
# sudo apt update && sudo apt install ffmpeg

# on Arch Linux
# sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
# brew install ffmpeg

3) Available Whisper Models

It offers six model sizes with trade-offs between speed, accuracy, and VRAM requirements. English-only (.en) variants often perform better for English inputs.

Size	Params	English-only	Multilingual	VRAM	Speed
tiny	39M	tiny.en	tiny	~1 GB	~10x
base	74M	base.en	base	~1 GB	~7x
small	244M	small.en	small	~2 GB	~4x
medium	769M	medium.en	medium	~5 GB	~2x
large	1550M	N/A	large	~10 GB	1x
turbo	809M	N/A	turbo	~6 GB	~8x

The turbo model is an optimized version of large-v3, offering faster transcription with minimal accuracy trade-off.

See Model Card for latest information on models.

4) Python Code Example: Basic Transcription

Here’s how to transcribe audio using Whisper in Python:

import whisper

model = whisper.load_model("medium")
result = model.transcribe("<Enter audio file path>")
print(result["text"])

Internally, it uses a 30-second sliding window to process audio, making autoregressive predictions for each segment.

5) Python Code Example: Live speech Transcription

This example demonstrates how to record audio live from your microphone using Python and transcribe it on the fly using OpenAI Whisper. It uses the sounddevice library to capture audio and runs the transcription in-memory, no audio file is saved.

This is useful for building offline dictation tools, accessibility utilities, or voice-controlled interfaces.

import whisper
import sounddevice as sd
import numpy as np

# Load the Whisper model
model = whisper.load_model("medium")

# Audio settings
samplerate = 16000  # Whisper expects 16000 Hz audio
duration = 5  # seconds to record

print("Speak into your microphone...")

# Record audio from the default microphone
audio = sd.rec(int(samplerate * duration), samplerate=samplerate, channels=1, dtype='float32')
sd.wait()
print("Recording stopped.")

# Convert audio to mono and scale to int16 (Whisper expects this format)
audio_np = np.squeeze(audio)
print("Audio conversion complete.")

# Whisper expects a numpy array with shape (samples,) and float32 in range [-1.0, 1.0]
# Transcribe directly without saving to a file
result = model.transcribe(audio_np, fp16=False)  # Set fp16=False if you're on CPU

print("Transcription:" + result["text"])

Here’s your updated section with a clear, concise description added under the title. It maintains technical accuracy and is consistent with your style and content.

6) Python Code Example: Language Detection

In addition to high-level transcription, it provides low-level access to internal processes such as language detection. This is useful when you need more control over audio processing, for instance, identifying the spoken language before choosing how to handle or translate the input.

The following code:

Loads an audio file.
Converts it to a log-Mel spectrogram.
Detects the spoken language.

import whisper

model = whisper.load_model("medium")

# Load and prepare audio
audio = whisper.load_audio("<Enter audio file path>")
audio = whisper.pad_or_trim(audio)

# Create log-Mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

7) Licensing and Intended Use

Whisper is released under the MIT License. While its capabilities are powerful and accessible, it is primarily intended for:

AI research on model robustness and generalization
Developers building accessible ASR tools (especially in English)

Note: The model is not suitable for real-time transcription out-of-the-box and may show varying performance across languages. It also has potential dual use implications, users should avoid deploying it in sensitive or high risk decision making environments without proper evaluation.

8) Full Source Code on GitHub

You can find the complete source code for this project in the GitHub repository below:

View the GitHub Repo

Speech Recognition in Python Using OpenAI Whisper

Table of Contents

1) What is OpenAI Whisper?

2) Installing Whisper in Python

3) Available Whisper Models

4) Python Code Example: Basic Transcription

5) Python Code Example: Live speech Transcription

6) Python Code Example: Language Detection

7) Licensing and Intended Use

8) Full Source Code on GitHub

9) Further Reading

AI Chatbots: LLM Conversations with Python

Table of Contents

1) What is OpenAI Whisper?

2) Installing Whisper in Python

3) Available Whisper Models

4) Python Code Example: Basic Transcription

5) Python Code Example: Live speech Transcription

6) Python Code Example: Language Detection

7) Licensing and Intended Use

8) Full Source Code on GitHub

9) Further Reading

Similar Posts