I need to OCR a bunch of things, and OCR software in general has been pretty bad throughout the years. And Linux OCR software is bad squared.
A number of software packages (possibly all of them) are trying to be front ends to the tesseract OCR engine. The thing I don't like about the Tesseract engine is that its documentation always discusses training the software. That sounds to me like the software doesn't actually work. I've seen this over and over, where people invite you to use stuff and the say, "oh, by the way, you don't mind helping to fix it do you?". That's just wrong. So, I suspect the entire Linux community is using some non-working hippie OCR software that basically doesn't work in most cases.
Testing conditions: Camera images of book pages were assembled into a PDF using Ghostview. The test is to open the file, OCR it as English text, then save the PDF with searchable text.
Testing Machine: MX Linux 19.2, Thinkpad W530 core i7 w/ max RAM.
Here's the list of software I'm testing (in order of testing):
PDF Studio Pro (demo)
YAGF: FAILThe interface looks nice and well organized, but the job took far too long and was killed. There was no progress update. Looks like a solid fail.
On second test, it looks like it is trying to extract the images from my PDF file before it updates anything. This is super-annoying. It's a massive design failure in the program. Any large file will just sit there, not allow you to do anything, and you have no idea what's going on. Hey guys, fix your stuff. Otherwise, it's a pleasant looking UI.
OCRMYPDF: FAILThe interface looks primitive. It segfaulted opening the PDF. So, no way.
gImageReader v. 3.3.0: ALMOST WINIt filled my /tmp device then failed. What a great program. At least it cleaned up after itself instead of leaving my disk full. This is a supposed to be a command line interface to Tesseract. What I'm seeing is ghostscript taking up the CPU initially, then python comes in. Apparently, it's a python script. There are a large number of command line options. For obvious reasons, you really want a graphical interface for this task. So, this is a non-starter. It also takes up 100% of the CPU. At least LIOS had the good sense to only launch on half the cores.
PDF Studio: (professional software) ~$130 FAILI was getting so accustomed to the cadence of failure, that I was surprised by software working at all and not bombing. This particular package has a relatively nice graphical interface. It's not that intuitive how to accomplish the obviously simple workflow of OCR on a PDF file, but it sort of works with a little exploration. It has a continuously updating display of the work as it moves through the pages. It's not fast by any means. The program is doing something incredibly stupid with memory allocation... the memory use varies from 1.9GB to 2.5GB. There's no reason to be allocating and deallocating memory. I suspect the slowness is somehow linked to object oriented design.
So, everything worked in the sense of OCR working. I was so pleased I decided to save, which is when it locked up my system. Needless to say, the output was unreadable. So, a pass on the OCR and a fail on being able to export it to PDF.
OCRfeeder: Massive Fail!The code is written in Java. Java has memory management issues. It ran out of heap space on an overnight run (that's how slow the OCR is). PDF Studio (Qoppa Software) has a well-done, intuitive interface, even if it does use the M$ ribbon menu. It seems to use all cores of my i7, but doesn't use hyperthreading somehow. Each core is only at 50% utilization, max. It has a real-time progress display, but it annoyingly doesn't respond to repaint requests, so if you move or lower the window, it won't repaint until it feels like it. Also, that means you can no longer control the background process. Not a good thing.
Another thing is that it's painfully slow, much slower than the Tesseract software. It failed on an overnight run. Why the hell are people so in love with Java. Memory management fails are a common occurrence with every Java program I've ever used. And YET, people keep trying to bash C++ over the head about it's bad memory management. Can you say Cult of Java?
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnIt managed to slurp all the images into the interface, then when I tried to write the output, it locked up XFCE and pegged the disk usage. Another stunning Java victory over Linux. This memory oopsie brought to you by Java-11.
The fundamental problem here is lack of experience on the part of the software designers. It's not EVER necessary to gulp the whole document into memory or pre-process the whole document.... EVER. You never know how large the document is going to be. At very least, you should check to see if the document size is larger than can be conveniently gulped or pre-processed.
Python is intrinsically slow and not suited to image analysis, even though everyone seems to believe these days that hooking into a C-based image library gives python super powers. And Java ..... how is it that the savior language of superior memory management always manages to run out of memory AND lock up my interface so that I have to power cycle to get it to stop? HOW? I was only mildly anti-Java before this experience, now I'm firmly committed to kill offJava, Rust, and every other pretender language. They don't have magic bullets, only magic PR departments.