Document processing software is designed to augment or eliminate the manual entry of data from a document into a business process or system. The goal is to get critical data from the business document to facilitate a business process that is associated with that document.
Today, many documents are usually electronic. So the initial scanning of the document is not always required, but extracting important data is still a menial task that can be automated in almost all situations.
The stages of document processing vary greatly depending on the software package, but they can be generally broken down into four categories:
Data is validated against external databases in this step as well. Anything that can run unattended will run in this stage.
The next step is validating and verifying the data. A lot of the automated validation should be done in the previous step.
But in this step, the “human in the loop” looks at broken business rules - missing or bad data and any validation errors. Any business rules that require a human to validate or verify data happen in this step. More on this step here.
The last stage of document processing involves exporting data. Document processing systems are largely middleware products, which means they move data from its original location to either a business workflow or its final resting place. The data and the associated document are exported in the desired format to an external system.
The less structure that there is, the more difficult it is to automatically pull data from a document.
It is generally easy to extract data from a structured document because you know exactly where the data is that needs to be extracted. For instance, a Social Security Number will always be in the same location, so extraction can be fine-tuned to get high levels of accuracy.
Semi-structured documents are also more difficult to extract data from because of the varying nature of the data. As a result, structured extraction techniques like static zones fail with semi-structured documents. Different techniques that analyze the data (not the physical location of the data) are needed for successful extraction.
But, it’s the variations of the documents that cause problems. Again, think of invoices. How many vendors you have dictates the potential complexity of the extraction effort. If every variation needs its own different set of extraction rules (which is common), it can take a long time to create a solution.
A lot of approaches will break down the variations into the 80/20 rule - taking the highest volume 20% of documents that represent 80% of the total dollar amount of the company’s payables. Modern semi-structured systems should scale so that it gets easier as more variations of a document are created, rather than a linear effort.
Getting data extracted from unstructured documents requires different methods than structured or even semi-structured extraction. Modern systems enable methods to be combined for greater flexibility. But there are some different methods that need to be considered:
The other challenge with unstructured documents is in the way document processing software has historically been designed. Structure was always important, as it indicated the type of document being processed in a lot of cases.
But with unstructured documents, there needs to be a method to analyze text without structure - as it flows through the document. For instance, in a structured and semi-structured environment, data won’t typically be separated by distance like it is in a paragraph.
As one sentence flows from left to right (in Latin-based systems), it also flows down (top to bottom), so now it’s possible for data to be at opposite ends of a paragraph.
Key-value pair extraction (commonly used in structured and semi-structured extraction) must be able to ignore the physical location of data in order to allow for extraction that is separated by distance and characters that are not part of the extraction.
Data validation is the most important step of a document processing software system. Unlike a misfiled document, wrong data in a document processing solution means the document is gone forever. It will most likely never be found.
All of the automation of the previous steps leads up to validation. The less human involvement needed means that the overall total cost of ownership (TCO) of the system will be lower. The system will be more efficient and cost less, which is the goal of an AI document processing system.
A key factor of validation is the user interface. The best document processing software will help a data entry operator be faster. A great user interface to get the best data validation efficiency includes:
The final step in a document processing software system is the export of the extracted and validated data and the resulting file. The data can be in various easily consumable formats, like JSON, XML, or even simple CSV files. The document format is usually PDF now that the national archives have adopted a standard around PDF.
But TIFF files are still common in a lot of document management systems and have some advantages over PDF files. Color files could also be exported in common picture formats like JPG or GIF, but those formats have challenges with long-term viability. PDF and TIFF formats can support color as well.
Data might also be exported directly to another business system, like a database or an ERP system. But, RPA, workflow, and other line-of-business systems are also very commonly used as target systems.
The biggest benefit of a document processing system is that it gives much-needed data to the next process. As a result, a workflow or an RPA engine has data with which to make their next step without human intervention.
T
IDP software products focus on the automation part of document processing, usually bringing in concepts of: