Martin Čajka: Beyond subtitles with AI

Martin Čajka / 29.04.2026
How We Built a Custom AI Dubbing Pipeline for 1/10th the Cost

How we built a custom AI dubbing pipeline for 1/10th the cost

In a growing company, communication is everything. But what happens when your team is a mix of Slovak locals and international colleagues?

At our core, we are a Slovak company. We value the ease and “hominess” of speaking our native tongue in meetings. However, we hit a wall because we also value our non-Slovak-speaking colleagues. We wanted to keep our internal culture intact while ensuring everyone stayed in the loop.

The “Transcribing” Nightmare

Initially, we relied on standard tools like Google Meet’s live transcripts. To put it bluntly, it was a nightmare. It doesn’t work at all for Slovak-to-English transcripts. We could transcribe it afterwards and upload the video with subtitles but when you have to divide your attention between a technical screen share and reading translated subtitles, you lose the most important thing: focus.

We realized that for educational videos, audio is king. We needed a way to provide English-speaking colleagues with a native-sounding experience without forcing Slovak speakers to switch languages mid-presentation.

👉🏻 Curious how we build things at Moxymind? Check out more technical recipes from our Moxyminders.

The Search for a Solution

We started where everyone does: the gold standard. We tested ElevenLabs, and the results were, frankly, incredible. The voice cloning was spooky-accurate, and the “dubbing” effect (syncing) was top-tier.

However, we ran into a math problem. Our use case involves 1 to 2 videos per month, but each video is between 60 and 90 minutes. At that volume, the pricing tier required for ElevenLabs felt overkill for our internal “good enough” requirements. We loved the tech, but the ROI wasn’t there for casual internal education.

The Winner: Our Own AI Pipeline

We decided to see if we could bridge the gap ourselves. By leveraging Voxtral models from Mistral for Speech-to-Text (STT) and Text-to-Speech (TTS) and also other models for transaltion logic, combined with some “magic” bussiness logic on our side  we built a custom pipeline.

The Tech Stack:

  • STT/TTS: Voxtral Models.
  • Translation/Logic: Mistral Large 3 + Magistral Medium
  • Voice Cloning: Custom implementation.

The Results: Is it better?

It depends on how you define “better.”

  • The Voice: We managed to keep the voice cloning feature. Our colleagues still sound like themselves, just in English and it is quite comaprable with ElevenLabs. So good job Mistral on that.
  • The Timing: If ElevenLabs is a 10/10 on lip-syncing and timing, we are sitting at a solid 7/10. There’s a slight “dubbed movie” feel, but for an educational video where the viewer is looking at code or a UI, it is perfectly acceptable.
  • The Price: This is the kicker. Our current operating cost is roughly 10% of the ElevenLabs quote.

Final Thoughts

By building our own pipeline, we solved the “focus” problem for our international colleagues without breaking the bank. We’ve managed to keep our Slovak meetings friendly and natural, while giving our English-speaking teammates the ability to learn in their own language—voices included. And last but not least we learned a lot and had fun implementing it.

Martin Čajka

I’m a QA guy with a strong focus on both frontend and backend test automation, and with passion for digging into new technologies and figuring out how AI can support us in what we do. Beside ones and zeros I love enjoying nature in all its forms.

Share: