Blogs on Document Processing and OCR Technology

What is Electronic Document Processing? How to Boost a System

Written by Tim McMullin | May 9, 2022

Surprisingly, many document processing systems can’t handle everyday document data projects. Yours included!

So what should you look for in the best processing software? Here are 7 important technologies.

Table of Contents:

 

What is Electronic Document Processing?

Electronic document processing is the method of helping humans and data management systems get data out of the documents that originate in electronic form.

A large amount of today's documents originate electronically. While this removes the need for scanning, many of these documents need some type of processing, and almost all require structured document management and retention.

Effectively managing a steady flow of incoming electronic files can quickly become an overwhelming, time-consuming task. Files may arrive in a variety of formats, such as:

  • PDF
  • Word
  • Excel
  • .Msg
  • Text files
  • Etc.

These document may also lack required index values and other vital information needed for processing and retention.


A Brief History of Electronic Document Processing

Document Capture has come a long way in thirty years - most things have.

Today what we call Electronic or Intelligent Document Processing came directly from pure document capture. The first iteration of electronic document processing (or whatever you call it) primarily handled archiving.

There were very few “document processing” systems, and the ones that existed were purpose-built to process static forms - think claims processing.

Around the turn of the century, electronic document processing started to become more widespread and expand into semi-structured forms as well, like invoices.

Ch-Ch-Ch-Changes

What changed? The traditional capture (archival) product was merged with a very good Optical Character Recognition (OCR) product that was able to do variable key-value pairs.

This technology still required “anchors” though. That means I had to have something static on the form to “register” the X,Y coordinates of a document, or even a field, so that OCR could account for “float.”

This is because, in the document scanning world, no two scans are the same. If you scan the same page twice, you have two unique documents - they are not identical. Small variations in document position, roller motor speed, and light cause a different version of the same document every time it is scanned.

Early systems fixed this problem by registering to a known point on the page and adjusting all other zones accordingly. It was kind of like getting a survey on your property - they start with a known point and measure from there.

Little Power = Little Data

Years ago, Document Capture systems were designed to grab a small amount of data. They got just enough data to file the document away in an ECM (quite literally an electronic file cabinet back in those days) so you could find it again.

The big benefit of those systems was their powerful Key-From-Image capabilities. The systems were designed for high-speed index operators. In fact, the right operators could get 20,000 keystrokes an hour and never touch a mouse.

Think of it like an assembly line for your document - Capture, Process, Route, or some variation of that theme. Those early systems were really designed with the idea that they needed to use as little compute power as possible because compute was expensive.

Kofax started out with its own controller boards that were installed in PCs to be able to scan at the rated speed of the scanner.

In fact:

  • A single board cost $3,000
  • OCR was done on a series of dedicated boards also installed in a PC
  • One model from Calera had four boards in a single PC

It wasn’t until the Pentium 90 that a PC could process OCR as fast as those purpose-built boards.

Slow & Steady Doesn’t Win the Electronic Document Processing Race

Back then, certain systems were decent at splitting up scanned batches and distributing the load across multiple OCR engines, but slow OCR was the bottleneck.

I remember talking to Countrywide Mortgage (remember them?) about being able to use a new text mining technology that was brought to market. They said, “We’d have to OCR everything?”

That job would have required more than 100 servers just to OCR their 4 million pages per day. This was before Virtual Machines were vogue, so this meant 100 physical servers. We never went any further with that deal — or that text mining technology — because slow OCR made projects crawl along.

Speed Kills in Electronic Document Processing

Fast OCR is probably the biggest difference from the early days to now. Today, my two-year-old laptop can process 6,000-8,000 images an hour. This drastic change has brought about the ability to automate processing at the front-end of a process instead of just archival scanning.

We have several customers today using Grooper as part of their product offering. They can’t meet their customer’s service-level agreements without Grooper OCR’ing and processing at thousands of documents per minute.

What is this Electronic Document Processing Guide for?

And with this rather lengthy introduction, I want to give you the seven things your IDP system (or electronic document processing system) needs in order to take advantage of modern tech.

This guide will give you the baseline you need to ensure the system you’re looking at is current and state-of-the-art.

Just a 1% improvement in your error rate will have dramatic results in your processing and cost savings!

Good luck, and happy processing!

7 Ways to Modernize Your Electronic Document Processing

#1: Flexibility to Solve Tomorrow's Document Problems

All of these systems are middleware systems. What I mean by that is that they are not the system of record. Instead, they feed those systems, or other intermediate systems (like Case Management) until the document hits it’s final resting place.

The number one thing your system needs is the flexibility to carry you into the next big thing in computing. How do you know if you have a system like that? Well, it’s open.

You need to be able to customize it as needed, and to be able to build new from old. Even if you’re in a low-code environment, if you need to, you can add on functionality through a built-in scripting integration.

But flexibility goes beyond that - everything about the system should be as flexible as possible. Think about how can you solve tomorrow’s problems with tech that was built today.

Searching for a Better Electronic Document Processing Software?
Check out Our Ultimate Guide!

So, you want a better software for processing electronic documents, but aren't sure what to look for in a system? In this free guide, you will discover:

  • Why traditional electronic document processing doesn't work, and what the best platforms do that's so effective
  • 3 Crucial technologies that the best software use
  • How machine learning is performed in the best electronic document systems
  • 15 image processing tools important for great processing

Download Now:

 

#2: Change Document Batches in Process

I started in this industry in 1994. Looking back, I wish I could get back all the time I wasted resending a full batch through just to test one single little thing.

I wasted time on silly things like adding logging out to text files and pop-up status windows, just so I could see what was going on with the system. Fast forward many years later to my first experience with Grooper, and I was stunned that BIS had solved this problem.

Now I can:

  • Test anything at any time, across any level - single field, single regular expression (regex), single page, full batch, etc
  • Seamlessly move batches into and out of production
  • Make copies of batches with a single click and work on that batch without affecting the model
  • Push the new model seamlessly back into production when I’m ready, and even update batches in progress as needed

This is what you need to be fully efficient in an IDP solution. There’s never a “you have to clear all the batches out of the system” moment. Never again.

 

#3: Supervised Machine Learning

The only successful systems in production today are based on supervised machine learning. The systems that tout an AI that learns as you use it are misleading at best - and worse, they just don’t work in production.

Even the leaders in AI research have recently said that big data will not solve the AI problem - only cleaner source data and the ability for each organization to build their own AI document processing system will solve AI’s problems. That’s exactly what a good supervised machine learning system does.

It allows a trained product administrator to “train” or “teach” the system how to treat a specific document. The problem is the word “train.” We tend to think that’s a simple one-time process, but it’s not.

It’s the iterative approach called ‘Textual Disambiguation’. This means that you form a hypothesis, test it against your documents, adjust, and repeat. You want a system that facilitates this test-fix-test approach as much as possible.

Having to scan a batch all the way through a process to see if it worked is 30-year-old technology. You should be able to test down to every single layer of your system - a single extractor for a date, for instance.

#4: Natural Language Processing (NLP) Built for Documents

NLP has come to the forefront of all our minds because we interact with it daily. Siri, Alexa and any other voice-recognition systems are powered by NLP.

But that’s what those systems were designed for. And IDP products fail when they take a standard NLP library and pack it into a new release. It’s the wrong tool for the job.

Instead, what you want is a system that borrows the concepts of NLP to apply to document processing. For instance, if I’m looking for an effective date for a contract, I can train Grooper what a valid effective date looks like. But this also means I have to have the ability to recognize paragraphs first.

Systems that haven’t done this pre-work are left struggling with finding chunks of data. In fact, they can’t find the right chunk because it’s all one big chunk to them.

A modern system will allow for data sectioning, which is the ability to limit the scope of what’s sent to the processing engine. All of these systems do better with less data, not more. If there’s only one date, it’s easier to tell which is the right date.

#5: In-Memory Processing

Grooper is currently processing billions of items per day for one of our largest customers.

Billions. Per day.

This isn’t an OCR process, it’s a text processing process. But in order to accomplish that with relatively modest server specs, we had to make sure nothing got in the way.

When you approach the billion milestone, scale is a whole different issue. If your system is sending data across the bus of the server, it just can’t scale.

Now, you may say you don’t need a system that processes a billion items per day. That’s fair enough. But if we’re talking about modernization, you want to make sure that the faster the hardware, the faster your processing will be, and that can only be fully recognized by in-memory processing.

 

#6: Computer Vision and Updated / Improved Image Cleanup Tools

The ability for systems to recognize images has become a cottage industry over the last decade. But, those systems were designed to spot features from pictures, not business documents.

There are one or two of the engines that have small by-products that really have advanced the science of getting text from pictures - but that’s all they do - it’s not focused on business documents.

However, a modern system will borrow from Computer Vision (CV) when needed to add extra value, or another set of “eyes” to really help ensure that business data is valid before sending it on to the next process. Similarly, a lot of systems out there are using the same image processing algorithms.

These algorithms, like deskew, line removal, and more, are critical to getting clean images which results in even better OCR. This means better data in, and better data out.

Make sure the system you are is using something fully-functioning and not Open Source. Those algorithms are old and outdated, and rife with security issues that vendors have great difficulty keeping up to date with. A tell-tale way to determine this is to look at the number of image cleanup offerings a system has. If the answer is 10 or less, it’s out of date.

 

#7: Software That Continues to Be Innovated

This last one gets me. When I first started working in this industry, I was proud to be part of an industry-leading company. But, with each acquisition I was part of, I saw something disturbing.

Sure, we all made a lot of money every time the company sold, and I’ve been part of that type of thing several times now. But what I didn’t like was that the product innovation didn’t keep up. The money was there, but it wasn’t pushed back into R&D - it was pushed into marketing.

In most cases, not much more than a new logo and name was the result of the acquisition. I’ve seen this now for almost 30 years and I can say that I don’t think I’ve ever seen a product get better after an acquisition.

I’ve seen all the consolidation in this industry first-hand. Grooper was built as a direct result of the lack of innovation and technology in this industry.

Our CEO says this is the hardest thing he’s ever done. We didn’t go down this route because we wanted to – but we had to in order to meet our client’s needs. You need a system that:

  • Continues to innovate and add new, compelling, client-requested features to the product
  • Finds new ways to save you time and money while processing these items
  • Continually pushes the envelope

Automation is cutting edge, so why would any less do for your automation project?