Blogs on Document Processing and OCR Technology

When Document Capture Intelligence Isn't So Intelligent

Written by Brad Blood | December 24, 2019

So your machine learning model seems accurate based on what you've fed it.

But what will happen when you feed it something new? And will the results be explainable?

You've heard of "black box" algorithms - mysterious AI where you're given results, but with no idea of how those results were obtained.

So... what's going on behind the scenes? And what would it look like if you were to shine a light into that black box into your document capture system?

How Document Capture Really Works

Weightings play a huge role. In the example of intelligent document processing — where you need classification or to identify contract clauses, provisional statements, or particular attributes — results are all tied to the frequency of features, words, or groups of words. 

Want to know why a result was presented? Look at the weightings. As machine learning models are trained, a numerical value is assigned to features, to determine what a document is and what's contained within it.

Here's a brief video that shows the benefit of exposing weightings in the Grooper software platform:

If this were a black box solution, you would be presented with the option to train (maybe), but certainly no options to see the behind-the-scenes weightings.

After training the machine learning model with a sample set of documents, you would expect good results on new documents the model has never seen.

But if your results were not acceptable, what would you do?

Sometimes too much training is too much of a good thing. By not seeing training results, it is possible to over-train and create a bad model (under-training is also bad!).

Why Some Document Capture Software is Very Weak

If you have a purpose-built (A.KA. black box) software performing document analysis, and wonder why it's capabilities are so limited - the answer is simple: the software is only designed to work within a constrained set of parameters.

This means it's limited to only specific document types, and will only extract data it was explicitly trained on.

For some use-cases, black box algorithms function just fine. However, as needs change, and business outcomes demand deeper answers on new (or existing) data-sets, finding a new tool may be in order.

What To Do if Your Software is Lackluster

Look for a solution with transparent A.I. - that shows the weightings and results of the actual machine learning training.

This will give you the upper hand in intelligent document classification, and extracting exactly the information you're looking for.