Document imaging software converts paper documents into electronic form. That doesn’t completely explain what it is, however.
The technologies that make document imaging so effective include: image processing, scanning software, document separation, OCR, document classification, document routing, handwriting recognition, and AI.
The basic software has transformed and become a staple of document automation solutions. We'll go through that in this guide to help you understand what document imaging software is, where it came from, and where it’s going.
Scanners handle same-size documents very well. But they do not handle a mixed batch of document sizes well, because the paper feed guides are needed to keep the paper straight as it goes through the scanner.
With this in mind, similar-size documents are best to be sorted together to make the scanner more efficient. If mixed-sized documents are scanned, the scanner operator will have to stop scanning more often to re-scan any skewed (crooked) documents, or resolve misfeeds and jams.
Simply stated, document preparation is a vital step in a successful document imaging software solution.
Here are 4 questions you need to answer about your documents:
That requires two things:
The document is scanned and automatically saved by the software. The next stage is automated and uses Optical Character Recognition (OCR) to convert the scanned image into usable text.
IMPORTANT NOTE: A system that does not have OCR built-into it (or the ability to integrate directly with some sort of OCR engine) is not a document imaging solution.
1) The first way is to find snippets of text data on the scanned document. For instance, in invoice scanning, it's common to look for an invoice date, and invoice number, a total and a few other fields.
There are several different ways this gets done. But the result is key-value pairings of data — where you find a tag (a key) like "invoice date" as a keyword, and then a corresponding piece of data that is in proximity to the key — the actual date of the invoice.
2) The other OCR option is to create a full-text conversion of the entire image. This can be used to create a searchable PDF.
The best systems out there do not differentiate between full-text and zonal (snippet) OCR. They OCR the document, and then focus on analyzing text.
It's important to know that OCR is not a perfect technology. Sometimes, OCR gives you data that is just wrong, but the OCR engine is convinced the data is accurate. A lot of systems have no way to correct bad OCR automatically. They rely on the human key from image (KFI) interface to fix errors.
This is the way data validation has been done historically in document imaging, and it's really what separates older document imaging systems from "intelligent document processing" systems. We'll discuss those in the "what's next" section of this blog.
There are various ways document imaging software can help facilitate the validation of OCR'd data:
For instance, looking up a PO number and retrieving the related fields that are related to that PO number. This type of functionality greatly improves the overall accuracy and automation of the document imaging system.
The important part is that the validation operator has an intuitive interface to help them quickly and easily find and fix errors.
The validation / verification / indexing stage typically runs on a separate machine from the scan station. This has historically been done to optimize the scanner speed to maximize that hardware investment.
In de-centralized environments, a more transactional setup is needed. Most document imaging software today will allow for either type of deployment:
The validation interface should also take advantage of color to help the operator quickly find errors. For instance, green for good, and red for bad, highlighted around a field that needs attention.
A lot of the other articles on document imaging software refer to document retrieval as part of the same system...but this is not the case.
Why do I say this? Because while there are document management systems that have some rudimentary scanning capabilities built-in, there are no document imaging software systems that do document retrieval. Document imaging software adds value to document management, workflow, and RPA solutions.
There are a number of questions you need to answer, including:
Once you have these answers, you can start to do research on the types of products that meet your needs.
The next section will highlight technologies used in document imaging software. This is important because not all products use all the technologies listed below.
Once you know what your criteria are, you can then focus on finding products that have the technology that you want and need.
While this list won’t be comprehensive, it should help you understand the different technologies that drive document imaging, so you can find a product that meets your specific requirements.
Why? Well, OCR does better with less “noise.” Non-text objects such as lines, shaded areas, and specks can be removed to “clean up” a document before it is OCR’d.
If you are scanning paper documents, it is imperative to have a technology that is using modern image processing algorithms. Computer Vision (a branch of Artificial Intelligence) has come a long way in the last two decades. Products need to be using powerful image processing to get the best data from OCR engines.
The only time a page should have to be rescanned is if it was folded or wrinkled going through the scanner resulting in a bad scan. Modern scanning software scans in color or greyscale and allows an operator to quickly and easily manipulate an image for the best image quality.
IMPORTANT NOTE: This technology should not be in the scanner driver! Instead, it needs to live inside the document scanning software for the best functionality.
The document scanning portion of the product should allow for multiple image sorting, drag-and-drop reordering of images, and ad-hoc image quality validation.
Document separation remains one of the more difficult parts of a document imaging solution. Production-level software should have multiple ways to separate documents both automatically and manually.
The OCR engine in question should be a robust, production-level engine, not one that is found free on the Internet like Tesseract or any of the other open-source engines. Those engines just haven’t kept up with the technological developments over the last decade.
Ideally, the document imaging software should have multiple OCR engines. Why? When more than one engine is used, better data recognition and extraction is the result.
Additionally, modern document imaging software should also integrate with newer cloud OCR engines, such as:
These engines are the cutting edge of quality, and often can yield results that the older, more traditional OCR engines can not produce.
Document imaging systems are not able to automatically tell the difference between document types on their own. That is a specific step that has to happen after OCR has been done.
In order to extract data from an invoice, the system has to:
Traditionally, separation was done by barcodes and patch codes that were placed on top of a document to be scanned (explained above). This has always been a highly reliable way to separate and classify documents (the barcode can be used for document classification as well), but it added significant labor in the document preparation stage.
For example: If you have 1,000 documents that are 2-3 pages long, you’d have to insert up to 500 pages with barcodes into those documents.
It was a very time consuming process, but it was still faster than scanning everything and manually sorting through documents after they were scanned.
Today, classification techniques vary, but can use text analysis or image analysis to perform complex separation and classification.
Still, you want to make sure your modern document imaging software still can perform barcode and patchcode separation because a lot of documents already have these pieces of data on them, and you can take advantage of pretty easily.
Several “modern” document imaging solutions have removed barcode and patch code detection at scan time. This is not recommended.
But when documents and data are exported...where does the data go?
In years past, documents went into a document management system. Which is basically a glorified file cabinet.
But today, the options are as varied as organizations themselves. For this reason, the document export or routing needs to be completely flexible. The export might be to a traditional document management system, or content services system.
But these days it will probably go to a workflow or an RPA bot for additional steps.
Document imaging systems have to support the full range of:
The data does not live in the document imaging system. It needs to go into a system of record, or a workflow of some sort for further processing.
This is especially useful if you need to process hand-filled forms, applications, or surveys. While this process is very much like OCR, most OCR engines do a poor job on handwriting extraction. This technology is commonly called Intelligent Character Recognition, or ICR.
IMPORTANT NOTE: If you have handwritten documents as part of your requirements, you will need to find a system that uses either Microsoft or Google’s engines. These are cloud-based engines, which can cause some privacy issues, but both vendors work with you to mitigate any potential issues.
Know that the validation requirements (humans) are higher for hand-written documents due to the extreme variation of the way people write. You will have to validate more data in a handwritten document process.
IMPORTANT NOTE: Any product that is leaning heavily on “AI” or “Unsupervised Machine Learning” (a branch of AI) should be very cautiously evaluated.
As of 2022, AI has not progressed anywhere near the level of human comprehension for document processing. Systems advertised to “get smarter over time” have been proven to do the opposite.
We encourage you to ask the vendor to explain their implementation of AI to you and how it helps them get better accuracy. If they cannot easily do this, you should understand the reasons — this technology is not ready yet.
Many think document imaging software will never be able to read, classify and extract your documents automatically without pre-configuration.
Intelligent document processing is the new name for document imaging software. As systems have become more complex and more documents became digital rather than paper, the processing style for documents has changed.
Intelligent document processing focuses much more on the ability to add AI-based technologies to traditional document imaging. Most Intelligent document processing systems are also document imaging software products, but not all.
Because of this, many intelligent document processing systems are particularly weak when processing paper documents. This is because they don’t have robust image processing tools to help clean up scanned documents for optimal OCR results. Intelligent Document Processing systems also are heavily focused on unstructured documents rather than more standard document types.
The critical questions to ask while searching for a document imaging software vendor are the ones that are relevant to your documents and data:
This is probably the area that has changed the most over the last 30 years. Scanners used to require proprietary PC boards to run at their full-rated speed. Kofax, Xonics, and a few others used to make these types of boards.
They were more than just a dumb connection — they had processing chips embedded in them to handle imaging-specific tasks such as compression and image cleanup. As a result, standard computers couldn’t keep up with a production-level scanner and do additional processing.
There are three basic ways to drive a scanner: