Blogs on Document Processing and OCR Technology

Document Image Processing Software: The Best Team Has the Best System

Written by Randall Kinard | December 12, 2019

Previously, I wrote to you all about the amazing technology that is Grooper’s OCR Synthesis. And I let you know that OCR is problematic, and causes headaches. When we built Grooper, we knew we couldn’t stop at simply relying on using the best OCR, we had to take it further.

This is where best document image processing software comes in. While there are some off-the-shelf solutions for cleaning up scanned images, we envisioned something more powerful, with easy to use pattern matching, Atomic Regular Expression, and the ability to use fuzzy logic.

Why Do The Best Keep Winning?

I think it’s easy to say over that last 10 or so years the New England Patriots have consistently been the best team in the NFL.

They’ve disrupted the norm, and even made enemies to become the consistent champions they are. To me, it’s obvious why this is the case:
The New England Patriots have implemented the best system for winning in today’s NFL.

Don’t get me wrong, the Patriots have had some great players on their team, but nothing has ever hinged on one specific person at any one specific time. Key players could be injured, yet they keep winning.

Every weather condition that could occur during a game has happened to New England, and they’ve overcome.

Go For The Win - Not Just Data ... Accurate Data

The name of the game is collecting data trapped in your documents. Many folks claim to be able to do this, but just like there are 32 teams in the NFL, only one gets to win. Grooper is the one to get it done.

OCR is notorious for throwing lots of unknown variables at you, and accuracy of data suffers as a result. Low quality images, lines and specks, logos, and more.

If only there was a system that let you win, no matter what…

A Good Defense: Start With The Best Document Image

OCR is at the heart of data collection for image based documents, so it makes the most sense that to get the best results from OCR, we need to give it the best image possible.

Grooper’s unique and highly configurable document image processing software puts in the hands of a user, the ability to always get OCR started off on the right foot. Line, box, and barcode detection/removal (among other features) give our extremely capable OCR Synthesis system the best chance at getting accurate character reads.

Without these tools, you’d be fumbling around with bad OCR.

A Good Offense: Attack The Data With Powerful Pattern Matching

Regular Expressions are by no means unique to Grooper, but the way Grooper allows you to easily write these expressions is.

Grooper document image processing software has a simple, easy to use interface for quickly iterating and seeing results. It also allows for objects to be created uniquely and independent of one another to get tight results, but then allow the results to be combined in specific and powerful ways to get the exact data being sought.

It’s like the spread offense, where multiple receivers give you multiple angles of attack.

Good Coaching: Overcoming The Variability With Fuzzy Regular Expressions

I won’t bore you with explanations of distance algorithms, but I can say most of you probably have experienced fuzzy logic in your life. If you’ve ever typed something into a search engine, and it corrected you with a suggestion, this is a type of fuzzy logic.  Spell correcting as you compose a text is another form.

But, again, Grooper goes further than simple fuzzy logic, and combines it with the power of pattern matching gained from Regular Expression. And so, Fuzzy RegEx is born.

The better you get at writing powerful patterns, the more accurate the results you’re going to get. The whole idea is that you’re telling the system, as specifically as possible, what you’re wanting to find. When OCR throws junk at you, the system can see past the mess, and give you what you need. It will encounter characters in the text that don’t match your pattern and transform it to what you need. When it replaces a character, it tells you with what confidence it thinks what it found, matches what you want.

Okay, now let’s take this concept even further. If I don’t want to allow for false positives in my returned results, I tell Grooper to not accept things below a threshold of confidence. Combine this with the ability to adjust how the system scores matches by allowing the swapping of arbitrary characters to score less, and you’re in complete control of getting what you want.

Good coaches are made or broken in the intensity and randomness of the 4th quarter of a game, and the choices they make to get around those things. Good data is broken by the random problems presented by OCR, and Fuzzy RegEx lets you get around those problems.

The Sweet Taste Of Victory with the Best Document Image Processing Software

A successful, well implemented system is the true key to repeated, sustainable success. This does not happen on accident, or without a lot of experience and hard work.