Blogs on Document Processing and OCR Technology

What is Document Processing Software?

Written by Tim McMullin | July 26, 2022

What is Document Processing Software?

Document processing software is designed to augment or eliminate the manual entry of data from a document into a business process or system. The goal is to get critical data from the business document to facilitate a business process that is associated with that document.

Document processing software used to be specifically for paper documents that were converted to electronic documents via document scanners.

Today, many documents are usually electronic. So the initial scanning of the document is not always required, but extracting important data is still a menial task that can be automated in almost all situations.

Get Our Free Guide to Document Processing:

Looking for the best way to get information off documents and forms? Not sure what to look for in intelligent document capture / processing software? No problem. In this guide, you will learn:

  • Why A.I. can't understand what's on your documents
  • 5 advanced features that make processing far more effective
  • How machine learning works
  • 15 specific tasks that the best intelligent document processing perform to get maximum data off forms and documents
Download Now:


OCR technology and solutions should not be confused with document processing. While OCR is a component of document processing, OCR is just one piece of the bigger picture. OCR technology is needed to convert scanned documents into text or to extract text from electronic documents where text never existed, like an image-only PDF, a scanned document, or even a digital photograph.

Table of Contents:

The 4 Stages of Document Processing

The stages of document processing vary greatly depending on the software package, but they can be generally broken down into four categories:

  1. Capture / Ingest

    • The first process is capturing the document, either through document scanners, email import, or different types of file imports.

      This stage is where documents are collected into batches or handled one-off in an ad-hoc process. But the file or files are converted from whatever form they were originally in, or whatever location they were in, to the document processing software.
  2. Processing: Cleanup, OCR, Classification, and Extraction

    • Quite a few automated steps happens in the processing stage. This is where document image cleanup and OCR are done, as well as document separation that wasn’t done at import, and classification and data extraction are also performed.

      Data is validated against external databases in this step as well. Anything that can run unattended will run in this stage.

  3. Validation / Verification

    • The next step is validating and verifying the data. A lot of the automated validation should be done in the previous step.

      But in this step, the “human in the loop” looks at broken business rules - missing or bad data and any validation errors. Any business rules that require a human to validate or verify data happen in this step. More on this step here.

  4. Export

    • The last stage of document processing involves exporting data. Document processing systems are largely middleware products, which means they move data from its original location to either a business workflow or its final resting place. The data and the associated document are exported in the desired format to an external system.

How Does Document Structure Type Affect Processing?

This is where document processing gets tricky. There are three types of document categories:

  1. Structured
  2. Semi-structured
  3. Unstructured

The less structure that there is, the more difficult it is to automatically pull data from a document.

Structured Documents

Structured documents are forms. For a specific kind of form, the format does not change from one document to another. Think of a tax form: The 1040 or 1040-EZ are going to be the same for at least one entire tax year. UB-92 and HCFA-1500 forms in health claims processing are another example of structured documents.

It is generally easy to extract data from a structured document because you know exactly where the data is that needs to be extracted. For instance, a Social Security Number will always be in the same location, so extraction can be fine-tuned to get high levels of accuracy.

Semi-Structured Documents

Semi-structured documents have some structure, but they vary from document to document, like invoices. Generally, the data is the same on every invoice, but the formatting changes for almost every invoice vendor, either in terms of:

  • Location of data
  • Lines or boxes around data

Semi-structured documents are also more difficult to extract data from because of the varying nature of the data. As a result, structured extraction techniques like static zones fail with semi-structured documents. Different techniques that analyze the data (not the physical location of the data) are needed for successful extraction.

But, it’s the variations of the documents that cause problems. Again, think of invoices. How many vendors you have dictates the potential complexity of the extraction effort. If every variation needs its own different set of extraction rules (which is common), it can take a long time to create a solution.

A lot of approaches will break down the variations into the 80/20 rule - taking the highest volume 20% of documents that represent 80% of the total dollar amount of the company’s payables. Modern semi-structured systems should scale so that it gets easier as more variations of a document are created, rather than a linear effort.

Unstructured Documents

Unstructured documents are the last and most difficult category of documents. These are typically:

  • Contracts
  • Letters
  • Any such documents that have no inherent structure

Getting data extracted from unstructured documents requires different methods than structured or even semi-structured extraction. Modern systems enable methods to be combined for greater flexibility. But there are some different methods that need to be considered:

  1. First, in order to extract data from unstructured documents, the software has to have the ability to determine and detect single paragraphs.
  2. Natural Language Processing techniques can then be applied to a paragraph to determine what category the paragraph fits into. It’s important for extraction to be able to determine both paragraphs and be able to categorize paragraphs in order to extract data from a paragraph.

The other challenge with unstructured documents is in the way document processing software has historically been designed. Structure was always important, as it indicated the type of document being processed in a lot of cases.

How to Analyze Unstructured Documents

But with unstructured documents, there needs to be a method to analyze text without structure - as it flows through the document. For instance, in a structured and semi-structured environment, data won’t typically be separated by distance like it is in a paragraph.

As one sentence flows from left to right (in Latin-based systems), it also flows down (top to bottom), so now it’s possible for data to be at opposite ends of a paragraph.

Key-value pair extraction (commonly used in structured and semi-structured extraction) must be able to ignore the physical location of data in order to allow for extraction that is separated by distance and characters that are not part of the extraction.

How Does Data Validation Help with Document Processing?

Data validation is the most important step of a document processing software system. Unlike a misfiled document, wrong data in a document processing solution means the document is gone forever. It will most likely never be found.

All of the automation of the previous steps leads up to validation. The less human involvement needed means that the overall total cost of ownership (TCO) of the system will be lower. The system will be more efficient and cost less, which is the goal of an AI document processing system.

What's Most Important for Great Data Validation?

A key factor of validation is the user interface. The best document processing software will help a data entry operator be faster. A great user interface to get the best data validation efficiency includes:

  • Keyboard shortcuts for most operations, so the data entry operator never has to touch a mouse
  • The document image being validated on one side of the screen, and the index data that has been extracted on the other side of the screen
  • The data requiring validation should be highlighted on the document
  • Common colors for good and bad should make it easy for an operator to spot errors (green and red, for instance)
Validation stations come in many flavors:
  • The best software will allow access through a standard web browser and will not have a cost “per user".
  • Older, legacy systems will charge for the number of validation users or even the stations that connect to the document processing server. With the adoption of web servers in document processing systems, this issue should be eliminated.

How to Export Newly Processed Data to Line-of-Business Systems

The final step in a document processing software system is the export of the extracted and validated data and the resulting file. The data can be in various easily consumable formats, like JSON, XML, or even simple CSV files. The document format is usually PDF now that the national archives have adopted a standard around PDF.

But TIFF files are still common in a lot of document management systems and have some advantages over PDF files. Color files could also be exported in common picture formats like JPG or GIF, but those formats have challenges with long-term viability. PDF and TIFF formats can support color as well.

Data might also be exported directly to another business system, like a database or an ERP system. But, RPA, workflow, and other line-of-business systems are also very commonly used as target systems.

The biggest benefit of a document processing system is that it gives much-needed data to the next process. As a result, a workflow or an RPA engine has data with which to make their next step without human intervention.

What About Intelligent Document Processing Software?

That question is answered in this blog on Intelligent Document Processing (IDP) software. Basically, it's a newer version of document processing software.

IDP software products focus on the automation part of document processing, usually bringing in concepts of:

  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • Natural Language Processing (NLP)