I’m working on trying to streamline the process of ripping my blu-ray collection. The biggest bottlneck in this process has always been dealing with subtitles and converting from image-based PGS to textbased SRT. I usually use SubtitleEdit which does okay with occasional mistakes. My understanding is that it combines Tesseract with a decent library to correct errors.

I’m trying to find something that works in the command line and found pgs-to-srt. It also uses Tesseract, but it appears without the library, the results are…not good:

Here’s the first two minutes of Love, Actually:

00:01:13,991 --> 00:01:16,368
DAVID: Whenever | get gloomy
with the state of the world,

2
00:01:16,451 --> 00:01:19,830
| think about
the arrivals gate
alt [Heathrow airport.

3
00:01:20,38 --> 00:01:21,415
General opinion
Started {to make oul

This is just OCR of plain text on a transparent background. How is it this bad? This is using the Tesseract “best” training data.

Edit: I’ve been playing around with ocr-to-pgs which also uses tesseract and discovered that subtitles having black outlines really messes with it. I made some improvements.

https://github.com/wydengyre/pgs-to-srt/pull/348

  • j4k3@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    17 days ago

    I’ve never had great results with tesseract if the image has compression so the mixed background sounds like a nightmare. There is probably some JavaScript stream in there but good luck accessing it. BR is hot garbage for a standard.

    • ch00f@lemmy.worldOP
      link
      fedilink
      arrow-up
      3
      ·
      17 days ago

      That’s the thing. There isn’t a background. The PGS layer is separate which is why it’s so surprising the error rate is so high.

      • j4k3@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        16 days ago

        OCR 5 from F-droid was really good for me like 2+ years ago, but when I tried it more recently it was garbage. It really stood out to me around 2 years ago because around 5 years ago I tried translating a Chinese datasheet for one of the Atmel uC clones and OCR was not fun then.

        Maybe have a look at Huggingface spaces and see if anyone has a better methodology setup as an example. Or look at the history of the models and see if one of the older ones is still available.

        • ch00f@lemmy.worldOP
          link
          fedilink
          arrow-up
          2
          ·
          17 days ago

          I think I spoke too soon when I said the text didn’t have a background or was otherwise clean… SubtitleEdit always shows it on a white background, but looks like the text itself actually has a white border which I’m sure is confusing the OCR. See my other comment for examples.

          I’m going to start by seeing if I can clean up the text, and if not, I’ll look into huggingface and whatnot. Thanks for the tips.

  • ch00f@lemmy.worldOP
    link
    fedilink
    arrow-up
    1
    ·
    17 days ago

    Found out that pgs-to-srt can export images, so you can see what it’s looking at.

    Starting to make sense why it’s so bad. Wonder if I can add a preprocessor to do something like this: