Local audio transcription with mlx-whisper: what actually works

| In Tech
| 4 minute read |by Constantin Gonzalez
A terminal window displaying a command execution log titled 'mlx_whisper --task 1/V/V/P/2026-01-13_YT_Starting_a_new_new_job'. The screen shows multiple timestamped entries in cyan text, each with a time range in brackets (like 02:16.220 to 02:17.420) followed by the repeated message 'Learn the terrain.' The timestamps progress from 02:16 to 02:29, showing the sequential execution of tasks during what appears to be a machine learning whisper model processing job.

I’ve been transcribing my YouTube videos locally for a few months now. It took some trial and error to get a setup that actually works reliably, so here’s what I learned.

The problem

I wanted good subtitles for my videos—for accessibility, but also so I could feed the transcripts to Claude for generating chapter markers and summaries. The auto-generated subtitles I saw on YouTube weren’t great: weird punctuation, mangled technical terms. And I didn’t want another SaaS subscription.

So I went down the rabbit hole.

First attempt: OpenAI Whisper

The obvious choice: OpenAI’s Whisper. Open source, well-known, everyone says it’s amazing.

And it is—kind of. The quality is good, but on my MacBook Pro M1 it was painfully slow. We’re talking much slower than real-time, with minutes of loading before it even started processing. It wasn’t using the GPU at all.

There had to be something better, so I kept on searching.

The solution: mlx-whisper

mlx-whisper runs Whisper on Apple’s MLX framework, optimized for Apple Silicon. The difference was night and day. Same 5-year-old MacBook, suddenly transcription was faster than real-time. (Yes, there’s still the initial delay for downloading the model weights from the Internet, but it caches them, so no big deal. There’s also a lag for loading the model weights into memory, but overall, still much, much faster than original Whisper.)

Installation is dead simple if you use uv:

uv tool install mlx-whisper

That’s it. No ML driver headaches, no dependency hell.

The Crisis

For a few weeks, life was good. Then I was transcribing a new video with tips on starting a new tech job, and the output looked like this:

Learn the terrain.
Learn the terrain.
Learn the terrain.
Learn the terrain.
Learn the terrain.

The same words, repeating for entire paragraphs. My transcript was useless.

I almost gave up and paid for a SaaS service. But I gave it another shot and searched for the problem in Perplexity. Turns out this is a known Whisper issue. The model uses previous context to improve accuracy, but sometimes gets stuck in a loop.

The Fix: One Flag

--condition-on-previous-text False

That’s it. One flag. It tells Whisper not to use its previous output as context, which prevents the repetition loop. The accuracy is still excellent without it.

My Complete Setup

Here’s the Fish shell function I use every week:

function gen-subtitles
    mlx_whisper \
        --task transcribe \
        --model mlx-community/whisper-large-v3-turbo \
        --language en \
        --output-format srt \
        --word-timestamps True \
        --max-line-width 42 \
        --max-line-count 2 \
        --condition-on-previous-text False \
        "$argv"
end

What each flag does:

  • --task transcribe: Transcribe (don’t translate, which apparently it can do, but I haven’t tried yet)
  • --model mlx-community/whisper-large-v3-turbo: Best balance of accuracy and speed I’ve found
  • --language en: Skip language detection, improves accuracy
  • --output-format srt: Standard subtitle format, works everywhere
  • --word-timestamps True: More precise timing
  • --max-line-width 42 and --max-line-count 2: Keeps subtitles readable on screen
  • --condition-on-previous-text False: The hero flag that prevents repetition

Now I just run gen-subtitles video.mp4 and get a clean SRT file in a few minutes.

Not 100% perfect, but close

mlx-whisper is really good, but not perfect. I still do a quick pass in vi to fix maybe 5% of the output—a comma that should be a colon, a technical term that got mangled, adding quotes around citations.

But that’s editing, not rewriting. 95% of the work is done well. Much better than paying $20/month or doing it by hand.

Beyond Subtitles

Once you have easy local transcription, you start seeing uses everywhere:

  • Voice memos: Capture ideas while walking, transcribe later into searchable notes
  • Meeting recordings: Transcribe locally, everything stays on your machine
  • Learning from podcasts: Transcribe, then ask Claude to summarize or extract key points (I use Simon Willison’s llm tool for this)
  • Content repurposing: Transcript → chapter markers → show notes → blog post

The transcript is so much more useful than for just subtitles.

Video

Here’s a short video walkthrough with a bit more personality and some fun drama elements:

YouTube video thumbnail
Play Video