4

Is there a way to convert dvdsub (image based) subtitles to srt? for example with mencoder or ffmpeg combined with tesseract?

I'm looking for something command-line based, and I'm ok with having to go through a couple of passes.

I'm less keen on GUI based tools.

simone
  • 211

1 Answers1

4

You probably already found a solution, but since this was the first search result for 'ffmpeg ocr dvdsub srt', here's a tool I use.

https://github.com/ruediger/VobSub2SRT

It is not perfect and may required some editing.

I was trying to find a feature in ffmpeg that does this better than my method, but I found this and remembered the rabbit hole I had to go down, so I hope this helps someone.

Here's my process

For extracting dvdsub from a .mkv

Using mkvextract from mkvtoolnix-cli

mkvextract video.mkv tracks 2:video.idx

  • arg 1 - The filename of video containing dvdsub
  • arg 2 - The extraction type
  • arg 3 - [Stream # containing dvdsub]:[Desired filename of extracted files].idx

My example would've produced a video.idx and a video.sub file

Generating subrip from .idx and .sub files

Using vobsub2srt

vobsub2srt uses tesseract and I found using tesseract's legacy mode works the best.

vobsub2srt --tesseract-oem 0 video

  • arg 1 - Tesseract Engine Mode (tesseract --help-oem for modes)
  • arg 2 - Legacy Mode
  • arg 3 - Filename of BOTH .idx and .sub WITHOUT extension

My example would've produced video.srt

Inspect and edit subrip file

Mistakes I've experienced

  • '|' instead of 'I', tesseract's legacy mode doesn't seem to make this mistake often.
  • ` instead of '
  • Spacing, when a line starts with '-', there may not be space between '-' and the first word.
  • Missing ' & "
  • 'I' or '|' instead of '[', legacy doesn't seem to make this mistake often.

Edit it

If you're not familiar with subrip files, they can be simply tossed into a text editor.

grep, vim, and sed are your friends.

Most mistakes from legacy mode can be easily ignore however.

Replacing dvdsub with subrip(srt)

Using ffmpeg

ffmpeg -i video.mkv -i video.srt -c copy -c:s subrip -map 0:v -map 0:a -map 1 final-video.mkv

  • arg 1 & 2 - Input #1 - Video file containing dvdsub
  • arg 3 & 4 - Input #2 - Subrip file
  • arg 5 - Codec used for all stream
  • arg 6 - Copies all streams (Only video and audio gets copied)
  • arg 7 - Subtitle Codec (Overrides arg 5 for subtitles)
  • arg 8 - Selects subrip as subtitle codec (may be redundant, but safe>sorry)
  • arg 9 & 10 - Maps video stream from 1st input to 1st stream in output
  • arg 11 & 12 - Maps audio stream from 1st input to 2nd stream in ouput
  • arg 13 & 14 - Maps subtitle stream from 2nd input to 3rd stream in output
  • arg 15 - Output filename

And done, I hope there is no character limit on here.