Blogs on Document Processing and OCR Technology

What is Document Imaging Software? How it Works in 4 Steps

Written by Tim McMullin | November 16, 2022

Document imaging software converts paper documents into electronic form. That doesn’t completely explain what it is, however.

What is Document Imaging Software?

Document imaging software are solutions that use processes and many technologies to convert or digitize paper documents into easy-to-use electronic document formats, such as PDF. The processes include document preparation, document scanning, data validation and data exporting.

The technologies that make document imaging so effective include: image processing, scanning software, document separation, OCR, document classification, document routing, handwriting recognition, and AI.

The basic software has transformed and become a staple of document automation solutions. We'll go through that in this guide to help you understand what document imaging software is, where it came from, and where it’s going.

Get Our Free Guide to Help You
Find the Best Document Imaging Solutions!

There are big differences between legacy and modern document imaging / capture software. But what are those differences? This evaluation guide gives you the 8 most important differences that will make the difference between project success and failure!

Table of Contents:


How Document Imaging Software Works in 4 Steps

Step 1: Document Preparation — Don't Underestimate It!

To have document imaging, a document has to be scanned. But before that, it has to be “prepared” or “prepped.” One of the most overlooked areas of document scanning is document preparation.

Scanners handle same-size documents very well. But they do not handle a mixed batch of document sizes well, because the paper feed guides are needed to keep the paper straight as it goes through the scanner.

With this in mind, similar-size documents are best to be sorted together to make the scanner more efficient. If mixed-sized documents are scanned, the scanner operator will have to stop scanning more often to re-scan any skewed (crooked) documents, or resolve misfeeds and jams.

Simply stated, document preparation is a vital step in a successful document imaging software solution.

Here are 4 questions you need to answer about your documents:

  1. What sizes of paper are your documents?
  2. Are there sticky notes on your documents?
  3. Are there paper clips or staples on your documents?
  4. Are your documents uniform in thickness? E.g. Are there varying weights of documents like onion skin and bonded paper?

Step 2: Document Scanning — What Really Makes Document Imaging

After documents are prepared, they are ready to be scanned. Document imaging is about getting the data off the document as quickly and easily as possible.

That requires two things:

  1. OCR and
  2. Some sort of manual Key-From-Image (KFI) capability

The document is scanned and automatically saved by the software. The next stage is automated and uses Optical Character Recognition (OCR) to convert the scanned image into usable text.

IMPORTANT NOTE: A system that does not have OCR built-into it (or the ability to integrate directly with some sort of OCR engine) is not a document imaging solution.

Two Ways that Document Imaging Uses OCR Software

OCR is used in two ways:

1) The first way is to find snippets of text data on the scanned document. For instance, in invoice scanning, it's common to look for an invoice date, and invoice number, a total and a few other fields.

There are several different ways this gets done. But the result is key-value pairings of data — where you find a tag (a key) like "invoice date" as a keyword, and then a corresponding piece of data that is in proximity to the key — the actual date of the invoice.

2) The other OCR option is to create a full-text conversion of the entire image. This can be used to create a searchable PDF.

The best systems out there do not differentiate between full-text and zonal (snippet) OCR. They OCR the document, and then focus on analyzing text.

Step 3: Data Validation — What Separates Old Systems from Modern & Intelligent Systems

Once text has been created from the scanned image, it needs to be validated. This is the most important and most time-consuming part of document imaging.

It's important to know that OCR is not a perfect technology. Sometimes, OCR gives you data that is just wrong, but the OCR engine is convinced the data is accurate. A lot of systems have no way to correct bad OCR automatically. They rely on the human key from image (KFI) interface to fix errors.

This is the way data validation has been done historically in document imaging, and it's really what separates older document imaging systems from "intelligent document processing" systems. We'll discuss those in the "what's next" section of this blog.

How Document Imaging Software Can Help with Data Validation

There are various ways document imaging software can help facilitate the validation of OCR'd data:

Validation With External Databases

Validation against external databases is one of the most powerful ways to validate data. Some document imaging software have a database "lookup" (an SQL query) into a target database. This can be used to validate a piece of OCR'd data, or it can be used to find additional information about the document.

For instance, looking up a PO number and retrieving the related fields that are related to that PO number. This type of functionality greatly improves the overall accuracy and automation of the document imaging system.

Validation With Calculations: How to Make it Fast

There are several other ways validation can be assisted as well:

  1. Calculations can be done to validate tabular data
  2. Snippets of code, called "scripts,” can be written to validate complex data that might have a checksum value, as a bank account number or credit card number does.

The important part is that the validation operator has an intuitive interface to help them quickly and easily find and fix errors.

The validation / verification / indexing stage typically runs on a separate machine from the scan station. This has historically been done to optimize the scanner speed to maximize that hardware investment.

In de-centralized environments, a more transactional setup is needed. Most document imaging software today will allow for either type of deployment:

  • For batch scanning, the “assembly line” approach works best
  • For distributed environments, where the volumes of paper per location are smaller, a single system that does scanning and indexing/validation works well
Why a Great Validation Interface is Important for Document Imaging

Validation interfaces are designed for high-speed data entry operators. This means that the index operator should be able to do their work without the assistance of a mouse. There should be numerous keyboard shortcuts that help operators navigate the various fields and screens.

The validation interface should also take advantage of color to help the operator quickly find errors. For instance, green for good, and red for bad, highlighted around a field that needs attention.

Step 4: Export Images and Indexed Data

Once the data has been validated, the last step is to export the images and corresponding index data.

A lot of the other articles on document imaging software refer to document retrieval as part of the same system...but this is not the case.

Why do I say this? Because while there are document management systems that have some rudimentary scanning capabilities built-in, there are no document imaging software systems that do document retrieval. Document imaging software adds value to document management, workflow, and RPA solutions.


How to Find the Right Document Imaging Software

To make sure you find the right document imaging software for your organization, you need to understand how you’ll be using document imaging in your organization.

There are a number of questions you need to answer, including:

  • What kind of document volume will be processed?
  • Is this just for one department, or is document imaging a part of an entire enterprise automation strategy?
  • Will departments do their own scanning, or will you form a centralized scanning department, like a center of excellence dedicated to converting documents into data?
  • If there is a desire for centralized capture, how will the documents get to my facility?
  • Can my remote offices do all of the work, some of the work or none of the work?

Once you have these answers, you can start to do research on the types of products that meet your needs.

The next section will highlight technologies used in document imaging software. This is important because not all products use all the technologies listed below.

Once you know what your criteria are, you can then focus on finding products that have the technology that you want and need.


8 Technologies Used in the Best Document Imaging Software

As you may have noticed, document imaging is a collection of technologies designed to help automate the translation of text from documents.

While this list won’t be comprehensive, it should help you understand the different technologies that drive document imaging, so you can find a product that meets your specific requirements.

#1: Image Processing Turn Off the Noise

Image processing refers to the set of algorithms that are used on a scanned document to clean it up and optimize it for OCR recognition.

Why? Well, OCR does better with less “noise.” Non-text objects such as lines, shaded areas, and specks can be removed to “clean up” a document before it is OCR’d.

If you are scanning paper documents, it is imperative to have a technology that is using modern image processing algorithms. Computer Vision (a branch of Artificial Intelligence) has come a long way in the last two decades. Products need to be using powerful image processing to get the best data from OCR engines.

#2: Document Scanning Software A Great Interface is Important

All document imaging systems should have a quality interface for scanning. In addition to the TWAIN or ISIS drivers mentioned above, the scanning interface should give a scan operator functionality to manipulate an image after it has been scanned.

The only time a page should have to be rescanned is if it was folded or wrinkled going through the scanner resulting in a bad scan. Modern scanning software scans in color or greyscale and allows an operator to quickly and easily manipulate an image for the best image quality.

IMPORTANT NOTE: This technology should not be in the scanner driver! Instead, it needs to live inside the document scanning software for the best functionality.

The document scanning portion of the product should allow for multiple image sorting, drag-and-drop reordering of images, and ad-hoc image quality validation.

#3: Document Separation All About the Barcodes

Document imaging software should support barcodes and patch codes at scan time for document separation. While newer technology is available for separation, a lot of documents already have barcodes that can be leveraged for separation.

Document separation remains one of the more difficult parts of a document imaging solution. Production-level software should have multiple ways to separate documents both automatically and manually.

#4: OCR Software — Don't Go Cheap

As mentioned above, document imaging systems must have OCR built into the software.

The OCR engine in question should be a robust, production-level engine, not one that is found free on the Internet like Tesseract or any of the other open-source engines. Those engines just haven’t kept up with the technological developments over the last decade.

Ideally, the document imaging software should have multiple OCR engines. Why? When more than one engine is used, better data recognition and extraction is the result.

Additionally, modern document imaging software should also integrate with newer cloud OCR engines, such as:

  • Microsoft
  • Google
  • Other large cloud providers

These engines are the cutting edge of quality, and often can yield results that the older, more traditional OCR engines can not produce.

#5: Document Classification Barcodes & Text Analysis

Before a document can be extracted, it has to be classified.

Document imaging systems are not able to automatically tell the difference between document types on their own. That is a specific step that has to happen after OCR has been done.

In order to extract data from an invoice, the system has to:

  • First know that the document is an invoice. It can do this by knowing how many pages it is.
  • It knows how many pages a document is through document separation, which relies on proper classification.

Traditionally, separation was done by barcodes and patch codes that were placed on top of a document to be scanned (explained above). This has always been a highly reliable way to separate and classify documents (the barcode can be used for document classification as well), but it added significant labor in the document preparation stage.

For example: If you have 1,000 documents that are 2-3 pages long, you’d have to insert up to 500 pages with barcodes into those documents.

It was a very time consuming process, but it was still faster than scanning everything and manually sorting through documents after they were scanned.

Is Barcode / Patch Code Separation Still Used in Today's Technology?

Today, classification techniques vary, but can use text analysis or image analysis to perform complex separation and classification.

Still, you want to make sure your modern document imaging software still can perform barcode and patchcode separation because a lot of documents already have these pieces of data on them, and you can take advantage of pretty easily.

Several “modern” document imaging solutions have removed barcode and patch code detection at scan time. This is not recommended.

#6: Document Routing Many Options are Available

Documents are passed from one “process” to the next while going through all the different processes we’ve discussed.

But when documents and data are exported...where does the data go?

In years past, documents went into a document management system. Which is basically a glorified file cabinet.

But today, the options are as varied as organizations themselves. For this reason, the document export or routing needs to be completely flexible. The export might be to a traditional document management system, or content services system.

But these days it will probably go to a workflow or an RPA bot for additional steps.

Document imaging systems have to support the full range of:

  • Potential repositories
  • Workflow systems
  • Cloud-based systems

The data does not live in the document imaging system. It needs to go into a system of record, or a workflow of some sort for further processing.

#7: Handwriting Recognition Huge Improvements Being Made

Handwriting recognition is one of the areas that has seen the most improvement over the last five years. It is now feasible to get free-formed handwriting, even cursive writing, extracted and converted to text.

This is especially useful if you need to process hand-filled forms, applications, or surveys. While this process is very much like OCR, most OCR engines do a poor job on handwriting extraction. This technology is commonly called Intelligent Character Recognition, or ICR.

IMPORTANT NOTE: If you have handwritten documents as part of your requirements, you will need to find a system that uses either Microsoft or Google’s engines. These are cloud-based engines, which can cause some privacy issues, but both vendors work with you to mitigate any potential issues.

Know that the validation requirements (humans) are higher for hand-written documents due to the extreme variation of the way people write. You will have to validate more data in a handwritten document process.

#8: Artificial Intelligence A Warning

There are a lot of vendors using “AI” as a buzzword. Buyer beware!

IMPORTANT NOTE: Any product that is leaning heavily on “AI” or “Unsupervised Machine Learning” (a branch of AI) should be very cautiously evaluated.

As of 2022, AI has not progressed anywhere near the level of human comprehension for document processing. Systems advertised to “get smarter over time” have been proven to do the opposite.

We encourage you to ask the vendor to explain their implementation of AI to you and how it helps them get better accuracy. If they cannot easily do this, you should understand the reasons — this technology is not ready yet.

Many think document imaging software will never be able to read, classify and extract your documents automatically without pre-configuration.


So What is “Intelligent Document Processing?

Intelligent document processing is the new name for document imaging software. As systems have become more complex and more documents became digital rather than paper, the processing style for documents has changed.

Intelligent document processing focuses much more on the ability to add AI-based technologies to traditional document imaging. Most Intelligent document processing systems are also document imaging software products, but not all.

Is IDP Worse at Processing Paper Documents?

Because of this, many intelligent document processing systems are particularly weak when processing paper documents. This is because they don’t have robust image processing tools to help clean up scanned documents for optimal OCR results. Intelligent Document Processing systems also are heavily focused on unstructured documents rather than more standard document types.

The critical questions to ask while searching for a document imaging software vendor are the ones that are relevant to your documents and data:

  • Do you have paper scanning needs?
  • Do you need to integrate with email?
  • Where else do your documents come from?
  • What internal systems do you have that hold data that can be used for automatic validation?
  • What target systems do the documents and data need to be routed to for storage and/or future use?

Of course, Grooper is our Intelligent Document Processing product that we built from 36 years of document capture experience. If you’d like to know more about document imaging or intelligent document processing, give us a call today.


How Document Imaging Software Works with Scanners

Document imaging software needs to have a wide range of support for scanners. There are a few ways software can "speak" directly to a document scanner. These are called "drivers" and must be installed with the scanner.

This is probably the area that has changed the most over the last 30 years. Scanners used to require proprietary PC boards to run at their full-rated speed. Kofax, Xonics, and a few others used to make these types of boards.

They were more than just a dumb connection — they had processing chips embedded in them to handle imaging-specific tasks such as compression and image cleanup. As a result, standard computers couldn’t keep up with a production-level scanner and do additional processing.

3 Ways That Today's Scanners Work with Document Imaging Software

But those days are long gone now. Scanners are connected today via ethernet, Wi-Fi, or USB. But they still need drivers.

There are three basic ways to drive a scanner:

  1. MAC and Windows PCs have built-in drivers for most scanners, but these drivers are not production-worthy.

    Like many generic drivers, they are limited in functionality. Some don't even allow you to change settings. This is a real problem for a production-level document imaging software solution.

  2. The last two options, ISIS or TWAIN drivers, are recommended: ISIS drivers were purpose-built in the 1990s to run high-speed document scanners.

    While the specification is technically open, scanner manufacturers have to write to this specification for their scanner to be supported by software that supports ISIS.

    ISIS drivers are the best drivers to get the most out of your scanner, as the specification was designed to only support scanners.

  3. The last type of driver is called a TWAIN driver. While TWAIN has come a long way in the last decade, its core flaw is that it was not only a scanner interface. It also supports digital cameras.

    This is bad because key functionality of a scanner is often hidden from the end user. While offering better control over scanners than a built-in Windows or MAC driver, TWAIN still often does not offer the full scanner functionality to be accessed.