5 Reasons Why Document Data Integration is Causing Massive Disruption

by Jesse Spencer | February 12, 2020

When you think about artificial intelligence and data integration, what's the first thing that comes to mind? Maybe deep learning neural nets crunching away at big data sets? Or aggregating data from dozens or hundreds of repositories and streaming it into analytics or business intelligence platforms? Or maybe predictive healthcare and extending human life?

I'm willing to bet you didn't think about data integration from documents and text files. It's got to be one of the most obscure use-cases for data integration. And, believe it or not, it's extremely relevant and getting new attention.

But why?


This article discusses 5 important technology innovations disrupting document data integration:

Many industries rely on documents for important workflows. In healthcare, large text files communicate explanation of benefits, email contains retrospective denials, then there's claims, and the list goes on and on. In manufacturing and supply chain, there's an endless amount of logistics documents. Legal runs on paper, and so does the oilfield. Education processes transcripts at a dizzying rate. Financial institutions process everything from mortgage and loan documents to personal checks. And you know that Government loves its forms...

Organizations in nearly every industry are saturated with paper, and storing millions of archived records (and all this data is a literal gold mine).

But why the renewed focus on documents? The reason is that technology has finally caught up. For decades, data contained on paper has been extremely difficult to integrate. While it's true that tools have existed for setting up rigid templates that "know" where certain data is on a document, their use is extremely limited.

In the real world, these templates have caused a lot of suffering because of how fragile they are. If a word or number is just outside of where the template is looking, another template must be created to find it. This is hardly scale-able.

Five Technology Innovations have Disrupted Document Data Integration

1. New Computer Vision Techniques

Computer vision (CV) is the technology responsible for making scanned documents machine-readable. All non-text artifacts on a document are no problem for a human to read past. We understand that a hole punch is not a word, and stamps, lines, barcodes, and images are all just there to support the intent of the document.

Wait, what about optical character recognition?

Glad you asked! OCR is only as good as the document image it runs on. Modern analytics and business intelligence platforms (and neural nets) all require very accurate (and labeled) data. Traditional OCR's low accuracy doesn't produce acceptable data. This is one of the reasons document data integration has been difficult to achieve.

New CV algorithms paired with advanced hardware acceleration enables near-100% OCR accuracy using both new and traditional OCR engines.

2. Visualized Machine Learning

A new approach to machine learning and classification sheds light on the complicated algorithms doing the heavy lifting. Solutions leveraging this technology provide a user interface which reveals trained data in a way that is easy to understand. This visualization framework automates human understanding of otherwise hidden algorithms.

The design philosophy behind this approach is that users understand their data better than anyone else, and that automating their understanding of how A.I. is operating is both easier and achieves better results than a "dark" machine learning model. This kind of transparency is based on the knowledge that a subject matter expert will always be able to make better decisions on data than "hidden" A.I.

3. OCR Enhancements

As previously mentioned, traditional OCR engines need help for maximum performance. Several key OCR innovations are at the core of  document data integration platforms.

4. Fuzzy Regular Expression

Regular expressions (RegEx) have been used to process text since the 1950's. Modern data science tools have enabled a new kind of RegEx that allows for less literal character matches. In fact, Fuzzy RegEx enables true machine reading by providing a more organic understanding of text. The way this innovation works is by "fuzzy matching" results to lexicons and external data sources by using weighted accuracy thresholds. Machines now return results that are "close to" what a user is searching for and that is extremely valuable in discovering data.

5. Classification Engines

Automating document classification is a critical step for accurate data integration. In many real-world scenarios documents are not always stored in the proper sequence, or manually separated by type. Humans have no problem looking at a document and understanding the context of the information. If we expect a machine to read and integrate data from documents, creating an understanding of the intent of the document is necessary.

Classification engines use machine learning or rules-based logic to recognize and assign a document type to a page, or a group of pages in a document.



4 Steps to Achieving Wisdom You can Use at Work Today

4 Steps to Achieving Wisdom You can Use at Work Today

How to create an Information as a Second Language program. [Free Guide]

4 Steps to Achieving Wisdom You can Use at Work Today

4 Steps to Achieving Wisdom You can Use at Work Today