[Update w/ Winner] Linux OCR software is broken: a review of Linux OCR software failures

There are a number of scientific software packages that work and some that don't really serve a purpose, other than to deceive people into thinking they are useful. This forum is for separating the two, listing weaknesses and strengths from direct experience with these packages. Please post reviews of software intended for scientific data analysis, plotting, display, modeling, interaction, statistical analysis, etc. Any branch of science is acceptable so long as the package can be used by that branch. If there is any doubt, please indicate which branches of scientific investigation or engineering are associated with the package.
Post Reply
loman
Posts in topic: 2
Posts: 5
Joined: Wed Mar 24, 2021 12:50 pm

Technical

[Update w/ Winner] Linux OCR software is broken: a review of Linux OCR software failures

#1

Post by loman » Fri Apr 16, 2021 9:40 pm

This particular topic is once again, not really a science topic, but sciency.

I need to OCR a bunch of things, and OCR software in general has been pretty bad throughout the years. And Linux OCR software is bad squared.

A number of software packages (possibly all of them) are trying to be front ends to the tesseract OCR engine. The thing I don't like about the Tesseract engine is that its documentation always discusses training the software. That sounds to me like the software doesn't actually work. I've seen this over and over, where people invite you to use stuff and the say, "oh, by the way, you don't mind helping to fix it do you?". That's just wrong. So, I suspect the entire Linux community is using some non-working hippie OCR software that basically doesn't work in most cases.

uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
LET'S TEST

Testing conditions: Camera images of book pages were assembled into a PDF using Ghostview. The test is to open the file, OCR it as English text, then save the PDF with searchable text.

Testing Machine: MX Linux 19.2, Thinkpad W530 core i7 w/ max RAM.

Here's the list of software I'm testing (in order of testing):
LIOS
YAGF
OCRMYPDF
gImageReader
PDF Studio Pro (demo)
OCRfeeder



Test Results:
LIOS: FAIL
The interface looks nice and well organized, but the job took far too long and was killed. There was no progress update. Looks like a solid fail.

On second test, it looks like it is trying to extract the images from my PDF file before it updates anything. This is super-annoying. It's a massive design failure in the program. Any large file will just sit there, not allow you to do anything, and you have no idea what's going on. Hey guys, fix your stuff. Otherwise, it's a pleasant looking UI.
YAGF: FAIL
The interface looks primitive. It segfaulted opening the PDF. So, no way.
OCRMYPDF: FAIL
It filled my /tmp device then failed. What a great program. :roll: At least it cleaned up after itself instead of leaving my disk full. This is a supposed to be a command line interface to Tesseract. What I'm seeing is ghostscript taking up the CPU initially, then python comes in. Apparently, it's a python script. There are a large number of command line options. For obvious reasons, you really want a graphical interface for this task. So, this is a non-starter. It also takes up 100% of the CPU. At least LIOS had the good sense to only launch on half the cores.
gImageReader v. 3.3.0: ALMOST WIN
I was getting so accustomed to the cadence of failure, that I was surprised by software working at all and not bombing. This particular package has a relatively nice graphical interface. It's not that intuitive how to accomplish the obviously simple workflow of OCR on a PDF file, but it sort of works with a little exploration. It has a continuously updating display of the work as it moves through the pages. It's not fast by any means. The program is doing something incredibly stupid with memory allocation... the memory use varies from 1.9GB to 2.5GB. There's no reason to be allocating and deallocating memory. I suspect the slowness is somehow linked to object oriented design.
So, everything worked in the sense of OCR working. I was so pleased I decided to save, which is when it locked up my system. Needless to say, the output was unreadable. So, a pass on the OCR and a fail on being able to export it to PDF.
PDF Studio: (professional software) ~$130 FAIL
The code is written in Java. Java has memory management issues. It ran out of heap space on an overnight run (that's how slow the OCR is). PDF Studio (Qoppa Software) has a well-done, intuitive interface, even if it does use the M$ ribbon menu. It seems to use all cores of my i7, but doesn't use hyperthreading somehow. Each core is only at 50% utilization, max. It has a real-time progress display, but it annoyingly doesn't respond to repaint requests, so if you move or lower the window, it won't repaint until it feels like it. Also, that means you can no longer control the background process. :( Not a good thing.

Another thing is that it's painfully slow, much slower than the Tesseract software. It failed on an overnight run. Why the hell are people so in love with Java. Memory management fails are a common occurrence with every Java program I've ever used. And YET, people keep trying to bash C++ over the head about it's bad memory management. Can you say Cult of Java?
OCRfeeder: Massive Fail!
It managed to slurp all the images into the interface, then when I tried to write the output, it locked up XFCE and pegged the disk usage. Another stunning Java victory over Linux. This memory oopsie brought to you by Java-11.
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

The fundamental problem here is lack of experience on the part of the software designers. It's not EVER necessary to gulp the whole document into memory or pre-process the whole document.... EVER. You never know how large the document is going to be. At very least, you should check to see if the document size is larger than can be conveniently gulped or pre-processed.

Python is intrinsically slow and not suited to image analysis, even though everyone seems to believe these days that hooking into a C-based image library gives python super powers. And Java :twisted: ..... how is it that the savior language of superior memory management always manages to run out of memory AND lock up my interface so that I have to power cycle to get it to stop? HOW? I was only mildly anti-Java before this experience, now I'm firmly committed to kill offJava, Rust, and every other pretender language. They don't have magic bullets, only magic PR departments.
Last edited by loman on Sun May 09, 2021 8:46 pm, edited 1 time in total. word count: 1024

loman
Posts in topic: 2
Posts: 5
Joined: Wed Mar 24, 2021 12:50 pm

Technical

[Update w/ Winner] Linux OCR software is broken: a review of Linux OCR software

#2

Post by loman » Sun May 09, 2021 8:45 pm

Master PDF Editor (free edition): Winner
It's quite amazing. It uses tesseract behind the scenes to perform the OCR (like everybody else) but it has proper controls and automatic language installation, everything in the right place and working. I loaded my large PDF file, selected Document->OCR, installed the English language, and boom, PDF with searchable text. No Crash! I don't know what to say. On anything other than Linux, this wouldn't be a big story.

Licenses for the full version start at $70 and go down with quantity. The full version allows you to edit documents and save as optimized PDFs. Both of these things seem worthwhile. The software's organization is decent, and the performance is optimal (so far as I can tell). There are other features like forms, watermarks, digital signing, but those are available in other packages (probably not as easy, but still possible).

We finally have a win for Linux. :D
Image
word count: 158

Post Reply