Advanced Imaging


Advanced Imaging Magazine

Updated: July 8th, 2008 05:26 PM CDT

The Transformation of Document Capture



May 2004

The Transformation of Document Capture

By Sameer Samat, Kofax CTO

Today, the document capture market is undergoing a metamorphosis. This change is driven by compelling new ways to utilize imaging within processes that are vitally important to businesses. For some time, imaging ? as related to document capture ? has been primarily viewed as a mechanism to archive content. In the last few years, several capture vendors have successfully moved document imaging from largely an archival application to a core piece of infrastructure, powering transaction-oriented business processes that originate in paper. Examples of such applications include processing invoices, proof of deliveries, and mortgage applications, as well as the full digitization of corporate mailrooms. The shift in focus of document imaging to these new business processes has created a need for applications to become more intelligent with respect to processing content, especially unstructured content.

There is a paradigm shift in the market to move beyond archival applications for document imaging systems. Systems need to automatically route documents to an appropriate workflow where information is extracted, as opposed to simply storing the document for future retrieval. In this context, it is becoming increasingly important for recognition technologies to extent traditional structured image processing techniques (registration points, zones, etc.) to analyze the text content inside these documents. Unstructured content, by definition, provides insufficient geometric information to accurately and efficiently perform document-type identification, sorting, and routing with image recognition techniques developed for structured forms processing. As imaging moves beyond archival purposes, applications will increasingly rely on text mining technologies, such as Text Classification, to provide the needed intelligence to automate the routing, identification, and sorting of unstructured content.

Even though Text Classification technology is extremely helpful in addressing the need for more intelligent document processing, there are a number of considerations and pitfalls to be aware of when implementing a solution. First, it is necessary to understand the basic steps in the process of Text Classification. After working knowledge of the basic steps has been established, a closer look at a number of options for configuring the ?rules? or ?intelligence? of the classification software reveals several potential pitfalls in deployment.

Overview of Text Classification

There are two basic high-level components to any text-classification system. First, there is a set of target categories to which a user would like to assign documents. These categories may be organized in a taxonomy structure with parent-child relationships, or they may be laid out in a flat non-hierarchical fashion. Second, there needs to be some ?intelligence? contained in a classification ?engine? that can determine the proper category for each document. Such intelligence can take the form of a fully manual process of people reading every document, or a fully automated process based upon a statistical analysis of the text contents of the document.

The process of text classification can be described as performing the following general steps:

1 2 3 4 5 next

Subscribe to our RSS Feeds