Surprisingly, many document processing systems can’t handle everyday document data projects. Yours included!
So what should you look for in the best processing software? Here are 7 important technologies.
Table of Contents:
Electronic document processing is the method of helping humans and data management systems get data out of the documents that originate in electronic form.
Effectively managing a steady flow of incoming electronic files can quickly become an overwhelming, time-consuming task. Files may arrive in a variety of formats, such as:
These document may also lack required index values and other vital information needed for processing and retention.
Document Capture has come a long way in thirty years - most things have.
There were very few “document processing” systems, and the ones that existed were purpose-built to process static forms - think claims processing.
Around the turn of the century, electronic document processing started to become more widespread and expand into semi-structured forms as well, like invoices.
What changed? The traditional capture (archival) product was merged with a very good Optical Character Recognition (OCR) product that was able to do variable key-value pairs.
This is because, in the document scanning world, no two scans are the same. If you scan the same page twice, you have two unique documents - they are not identical. Small variations in document position, roller motor speed, and light cause a different version of the same document every time it is scanned.
Early systems fixed this problem by registering to a known point on the page and adjusting all other zones accordingly. It was kind of like getting a survey on your property - they start with a known point and measure from there.
Years ago, Document Capture systems were designed to grab a small amount of data. They got just enough data to file the document away in an ECM (quite literally an electronic file cabinet back in those days) so you could find it again.
Think of it like an assembly line for your document - Capture, Process, Route, or some variation of that theme. Those early systems were really designed with the idea that they needed to use as little compute power as possible because compute was expensive.
Kofax started out with its own controller boards that were installed in PCs to be able to scan at the rated speed of the scanner.
In fact:
It wasn’t until the Pentium 90 that a PC could process OCR as fast as those purpose-built boards.
Back then, certain systems were decent at splitting up scanned batches and distributing the load across multiple OCR engines, but slow OCR was the bottleneck.
That job would have required more than 100 servers just to OCR their 4 million pages per day. This was before Virtual Machines were vogue, so this meant 100 physical servers. We never went any further with that deal — or that text mining technology — because slow OCR made projects crawl along.
We have several customers today using Grooper as part of their product offering. They can’t meet their customer’s service-level agreements without Grooper OCR’ing and processing at thousands of documents per minute.
And with this rather lengthy introduction, I want to give you the seven things your IDP system (or electronic document processing system) needs in order to take advantage of modern tech.
This guide will give you the baseline you need to ensure the system you’re looking at is current and state-of-the-art.
Just a 1% improvement in your error rate will have dramatic results in your processing and cost savings!
Good luck, and happy processing!
All of these systems are middleware systems. What I mean by that is that they are not the system of record. Instead, they feed those systems, or other intermediate systems (like Case Management) until the document hits it’s final resting place.
You need to be able to customize it as needed, and to be able to build new from old. Even if you’re in a low-code environment, if you need to, you can add on functionality through a built-in scripting integration.
But flexibility goes beyond that - everything about the system should be as flexible as possible. Think about how can you solve tomorrow’s problems with tech that was built today.
So, you want a better software for processing electronic documents, but aren't sure what to look for in a system? In this free guide, you will discover:
Download Now:
I started in this industry in 1994. Looking back, I wish I could get back all the time I wasted resending a full batch through just to test one single little thing.
Now I can:
This is what you need to be fully efficient in an IDP solution. There’s never a “you have to clear all the batches out of the system” moment. Never again.
The only successful systems in production today are based on supervised machine learning. The systems that tout an AI that learns as you use it are misleading at best - and worse, they just don’t work in production.
It allows a trained product administrator to “train” or “teach” the system how to treat a specific document. The problem is the word “train.” We tend to think that’s a simple one-time process, but it’s not.
It’s the iterative approach called ‘Textual Disambiguation’. This means that you form a hypothesis, test it against your documents, adjust, and repeat. You want a system that facilitates this test-fix-test approach as much as possible.
Having to scan a batch all the way through a process to see if it worked is 30-year-old technology. You should be able to test down to every single layer of your system - a single extractor for a date, for instance.
NLP has come to the forefront of all our minds because we interact with it daily. Siri, Alexa and any other voice-recognition systems are powered by NLP.
Instead, what you want is a system that borrows the concepts of NLP to apply to document processing. For instance, if I’m looking for an effective date for a contract, I can train Grooper what a valid effective date looks like. But this also means I have to have the ability to recognize paragraphs first.
Systems that haven’t done this pre-work are left struggling with finding chunks of data. In fact, they can’t find the right chunk because it’s all one big chunk to them.
A modern system will allow for data sectioning, which is the ability to limit the scope of what’s sent to the processing engine. All of these systems do better with less data, not more. If there’s only one date, it’s easier to tell which is the right date.
Grooper is currently processing billions of items per day for one of our largest customers.
Billions. Per day.
When you approach the billion milestone, scale is a whole different issue. If your system is sending data across the bus of the server, it just can’t scale.
Now, you may say you don’t need a system that processes a billion items per day. That’s fair enough. But if we’re talking about modernization, you want to make sure that the faster the hardware, the faster your processing will be, and that can only be fully recognized by in-memory processing.
The ability for systems to recognize images has become a cottage industry over the last decade. But, those systems were designed to spot features from pictures, not business documents.
However, a modern system will borrow from Computer Vision (CV) when needed to add extra value, or another set of “eyes” to really help ensure that business data is valid before sending it on to the next process. Similarly, a lot of systems out there are using the same image processing algorithms.
These algorithms, like deskew, line removal, and more, are critical to getting clean images which results in even better OCR. This means better data in, and better data out.
Make sure the system you are is using something fully-functioning and not Open Source. Those algorithms are old and outdated, and rife with security issues that vendors have great difficulty keeping up to date with. A tell-tale way to determine this is to look at the number of image cleanup offerings a system has. If the answer is 10 or less, it’s out of date.
This last one gets me. When I first started working in this industry, I was proud to be part of an industry-leading company. But, with each acquisition I was part of, I saw something disturbing.
In most cases, not much more than a new logo and name was the result of the acquisition. I’ve seen this now for almost 30 years and I can say that I don’t think I’ve ever seen a product get better after an acquisition.
I’ve seen all the consolidation in this industry first-hand. Grooper was built as a direct result of the lack of innovation and technology in this industry.
Our CEO says this is the hardest thing he’s ever done. We didn’t go down this route because we wanted to – but we had to in order to meet our client’s needs. You need a system that: