Transcribing videos or audio files into text can unlock a whole new range of possibilities — from creating searchable notes and summaries to repurposing content for blogs or newsletters. With Whisper from OpenAI you can easily convert spoken words from a video into an accurate text file. In this post, we’ll walk through how to extract the transcript of a video using Whisper, step by step, so you can streamline your workflow and focus on what really matters: the content itself.
How to Install Whisper on macOS with Homebrew
Whisper is a speech recognition tool that allows you to transcribe audio and video into text with high accuracy. If you’re on macOS, you can set it up quickly using Homebrew.
Step 1: Update Homebrew
brew update
Step 2: Install Python
Whisper runs on Python, so make sure you have it installed:
brew install python
Step 3: Install FFmpeg
Whisper uses FFmpeg to process audio and video files:
brew install ffmpeg
Step 4: Install Whisper via pip
Finally, install Whisper from PyPI:
pip3 install -U openai-whisper
Transcribe a Video
Whisper examples
Once installed, you can transcribe any video or audio file directly from the terminal. For example:
whisper input.mp4 --model small --language en --output_format txt
This will create a file called input.txt containing the transcript.
With just these few steps, you can turn your videos into clean text files and start building searchable notes, summaries, or even blog content in no time.
Variation for a German language video:
whisper "meinVideo.mp4" \
--model medium \
--language de \
--task transcribe \
--fp16 False \
--output_dir ./_Transkripte \
--output_format txt
💡 Depending on your video and the parameters you choose, the process may take some time to complete. Once you have the text file, you can feed it into an LLM to generate a summary or perform any other task you need.
Whisper CLI Cheat Sheet
Here’s a quick reference to the most common Whisper command-line options:
Basic Usage
whisper input.mp4 --model small --language en --output_format txt
Models
--model tiny (fastest, least accurate)
--model base
--model small
--model medium
--model large (most accurate, slowest)
Input & Output
# Format of the transcript.
--output_format [txt|json|srt|vtt|all]
# txt: plain text
# json: structured data
# srt: subtitle file
# vtt: WebVTT file
# Directory for saving transcripts. Default: current folder.
--output_dir path/
# Custom filename (without extension).
--output_filename name
Language & Translation
# Set input language (ISO code). If omitted, Whisper tries auto-detection.
--language en
# Transcribe speech into the same language.
--task transcribe (default)
# Translate non-English speech directly into English text.
--task translate
Performance & Precision
# Select computation device (CPU or GPU).
--device cpu / --device cuda
# Force 32-bit floating point (useful if GPU doesn’t support FP16).
--fp16 False
# Sampling temperature; higher values produce more diverse results.
--temperature 0.0
# Number of candidates to consider during sampling (default: 5).
--best_of N
Segments & Timestamps
# Maximum characters per line (useful for subtitles).
--max_line_length N
# Maximum lines per segment.
--max_line_count N
# Suppress specific tokens (e.g., unwanted words).
--suppress_tokens -1
More examples
# Transcribe German audio into text
whisper talk.mp3 --model medium --language de --output_format txt
# Create English subtitles (SRT) from a Spanish video
whisper video.mp4 --model large --language es --task translate --output_format srt
# Save transcripts in a custom folder
whisper input.wav --model small --output_dir transcripts/