So, you’re here because you need to OCR data on text documents for full-page search or data extraction. Great!
OCR software is a bit like a video game, and the problems attacking good OCR are the 'villains'. As you'll see, there are ways to beat these villains, too get far better OCR text recognition.
The benefits are:
- Cost savings (reducing manpower needed for manual entry)
- Increasing profits (bringing new solutions to market quicker)
First, What is OCR?
Most people who have heard of OCR think of it as a feature included with another piece of software to perform word searches. OCR is really just an algorithm that recognizes known characters on a page.
Optical Character Recognition, or OCR, has essentially been around since 1913. First used to interpret Morse code and assist the blind, OCR technology has continued to evolve. Grooper OCR technology is the pinnacle of this evolution by the creation of intelligent digital awareness of a document. More on that later...
Using OCR recognition technology, a computer compares patterns made by letters and numbers on scanned documents to a set of characters stored in the software. It can then automate time-consuming tasks like data entry.
If OCR and the process of digitizing text is like a video game, then patterns make up the rules of the game.
We want our computer software to reliably recognize patterns, or pattern matching won’t work, and we won’t fare very well at this game.
How to Beat Poor Scan Quality and Low Contrast Images
As OCR works to recognize patterns, many things confuse the technology and cause problems. To give you an accurate digital conversion, OCR needs black and white scans that are at the perfect resolution (quality).
Having a black-and-white scan creates high image contrast, making the job simpler for OCR. When it comes to resolution, a low resolution scanned image of a document creates a lot of "noise" around the text letters and numbers, confusing OCR.
Low black and white contrast and poor resolution are the villains that we must beat. Thankfully, image processing tools have come the rescue to help us defeat these villains.
Using these tools will give you great black-and-white images, giving you the best starting point for OCR. Later, fuzzy data extraction will overcome poor resolution on PDFs or other image files. But there are many more tough villains ahead of us to defeat. Lucky for you, we have the cheat codes for this game.
The Tools - or “Cheat Codes” - to Overcome OCR Text Villains
Here are the villains we're fighting:
And here is the master cheat code we can use to beat all of these villains:
Just like any good cheat code, these let you break the rules of the game. These cheat codes, or tools, allow Grooper to overcome typical problems associated with full text OCR. In the past you only got one pass at an entire page of relatively complex symbols to get it right. No longer.
Let’s take a look at each one.
Level 1 Villain: Boxed-In Text (Also known as Bounded Regions)
Cheat Code: Bounded Region Detection
A bounded region is whitespace surrounded by lines. Or simply put, text in a box.
Imagine an invoice, purchase order, or delivery receipt. These kinds of documents are comprised of text in tables and boxes.
Using Bounded Region Detection, Grooper finds those boxes, looks only at what’s inside them, and gets a great read on the contents.
Grooper just took what has typically been a nightmare for legacy image cleanup software to get rid of and used it to its advantage. It’s like a judo master that took someone twice their size running at them and used their momentum against them to flip them on the ground…with just his pinky.
Level 2 Villain: Segments
Cheat Code: Segment Reprocessing
A segment is a small block or line of text on a page. If any segment gets a low OCR confidence score, Grooper uses Segment Reprocessing to run OCR on that segment a second time.
The outcome is better results for each of these troublesome lines.
Level 3 Villains: Different-Sized Fonts and Free-Floating Text
Cheat Code: Iterative OCR
Documents frequently use different-sized fonts and free-floating text that are't in alignment.
OCR software reads pages from top to bottom and left to right like we do. Therefore, if fonts on the left are of a different size or are out of alignment with text on the right side, typical OCR generates poor results.
Grooper solves this problem with Iterative OCR, or by reading the document multiple times. The first time, it will read everything.
The second time, using Iterative OCR, Grooper will drop out what it read well, and then only read what was previously read poorly.
The dropped-out text will no longer interfere with remaining text, resulting in a much better read. This process repeats until the villain has been defeated.
Level 4 Villain: Multi-Column Layouts
Cheat Code: Cellular Validation
A page with two or more columns presents a challenge for typical OCR software.
Text in columns may be different font sizes, or the lines may be offset from one another. Many OCR processes will have a total breakdown and produce very inaccurate results.
However, Grooper uses Cellular Validation to create highly customizable OCR regions by splitting a document into specific rows and columns.
Rows and columns are defined by the Grooper user who understands the document’s structure and layout. Grooper then reads each of the individual rows and columns independently, understanding the difference in each section and how each section relates to the overall document.
The Cheat Code: OCR Synthesis
Grooper combines the data produced from Bounded Region Detection, Segment Reprocessing, Iterative OCR, and Cellular Validation into one logical text flow.
Advanced font awareness re-analyzes spaces, tabs and new line feeds during OCR Synthesis.
Grooper ensures that all characters in a document are not just recognized but are assembled together in logical groupings. This is especially useful for check processing.
OCR Synthesis sets Grooper apart from traditional OCR systems by providing a much-improved foundation for accurate and reliable data capture.
Beating the OCR Game
Grooper has the instant-win button, the buzzer-beating 3 pointer, the ability to instantly learn Kung Fu by plugging into the Matrix, the - well, you get the point.
But here’s the rub: few things in life are ever 100%, and perfect OCR results are one of them.
Not even with Grooper’s awesome cheat codes can OCR recognize and accurately convert 100% of a document’s text. But with these tools, or cheat codes, we are far closer to perfect text than we were previously.
I wish you the best in your document processing tasks and happy gaming!