The document discusses intelligent document and data capture. It defines document capture as converting paper documents to electronic images, while data capture extracts data from business forms. The five steps of the capture process are described as capture, classify/organize, extract, validate, and deliver. Technologies discussed for capture include optical character recognition (OCR), barcodes, handwriting recognition, and data mining. Future directions highlighted are increased cloud computing, security, data mining/classification, and mobility.
1. Learn What Is Intelligent Document and Data Capture
and Get Started
The Paperless Office…
Chasing the Impossible?
2. In a now famous (or infamous) 1975
issue of BusinessWeek titled “The Office
of the Future” technologists describe
“The Paperless Office.”
3. “Vincent E. Giuliano of Arthur D. Little,
Inc., figures that the use of paper in
business for records and correspondence
should be declining by 1980, ‘and by 1990,
most record-handling will be electronic.’”
4. I think we can all agree that we’re
not there yet.
5. How about we agree that what we really
want is “The Nearly Paperless Office”?
6. The first part of any Document or
Content Management System is capture.
8. To keep it simple let’s stick with AIIM’s (Association
for Information and Image Management) definition.
AIIM is a nonprofit, serving information and image
professionals.
9. “Document capture and data capture are not the
same thing. Document capture is the conversion of a
paper document into an electronic image of that
document. Data capture extracts data from a
business form”.
13. So what is the capture process?
There are many models, from broad three-
step processes to more specific five-step
processes.
14. So what is the capture process?
There are many models, from broad three-
step processes to more specific five-step
processes.
Let’s go with the five-step.
15. 1. Capture
Paper Sources: Electronic Sources:
Captured with scanners or
MFP devices.
Network directories, emails,
electronic forms, print streams,
faxes…anything made of 1’s and 0s.
17. 2. Classify/Organize/Categorize
Identifying what the document or information is in
order to correctly process and deliver the document
and extract the information.
Invoice ContractTax Form
Patient
Record
?
How should it be processed? Where should it be
routed and stored?
18. 3. Extract or Mine
Capturing data for the index or other purposes.
May be data such as
customer number, freight
tracking number, invoice
number, supplier name
etc.
Or, full-text indexing may
be required where all
text on the documents
are captured. See What
is Document Indexing.
19. 4. Validate
Using technology or manual inspection to ensure that
a document is classified and processed correctly
20. 4. Validate
With technology this may mean automatically validating
against data sources or employing business rules.
For instance if an inventory item should contain three alpha
characters followed by five numbers, all documents not
following that scheme may be tagged for manual inspection
before further processing is done.
PEN21096
CAP36581
INV98453
PA568793
21. 5. Deliver or Integrate
…to or with a search and retrieval or content
management system.
Obviously, without a system to
locate documents or data, a system
is useless.
22. Henry Schein,
Dentrix, Dentrix
Enterprise
Dentrix Ascend,
Easy Dental
Viive,
DentalVision,
axiUm
5. Deliver or Integrate
Often index information is sent to the document
management system via an XML or CSV file where it can be
made immediately available to the user.
Systems such as SharePoint, Epic, Laserfiche and other
ECM, EMR, EHR systems have various ways of accepting
data feeds
Filenet
Laserfiche
Documentum
MyMedicalRecords
Eaglesoft
Allscripts
Epic
Dentrix
CSV or XML
26. • Split Files
• Classify Documents
• Route Files
• Index
• Name Files
• Bookmark PDFs
Use Barcodes to …
Learn more at What Can Barcodes Do For Me?
27. OCR is another mature data capture technology to...
• Digitize text images so that they can be electronically
edited, searched, and stored
• Make image-based files fully text-searchable or extract
data from a zone for indexing
• Identify document areas for automatic OCR capture
(zonal OCR)
• Drag-and-drop highlighted document text which is
automatically OCR'd and dropped into index fields (drag
and drop OCR or rubber band OCR)
• Use extracted data to split, name, route, validate, etc.
28. Other Recognition Technologies For Data
Capture
• Handwriting recognition
• Not as accurate as OCR, limited role in some capture systems
ICR (Intelligent Character
Recognition)
• Capturing human-marked data from document forms such as
surveys and tests.
• Like ICR, lower accuracy, limited application within data capture
OMR (Optical Mark Recognition)
• Uses BCR, OCR, ICR and OMR in a structured data capture format
• Typically templates are designed to instruct the capture software
where to look for information and how to process the information
Forms Recognition
29. Data or Text Mining
(Often using Regular Expressions (regex))
A fast and powerful method to search, extract and
replace specific data found within scanned documents.
30. • Essentially a special text string for
describing a search pattern.
• Extremely flexible and patterns can be
constructed to match almost anything.
• Use data identified with regex to
classify, split, name and route files.
Learn more at Using Regular Expressions for Automated Data Capture and Extraction.
Data or Text Mining
(Often using Regular Expressions (regex))
31. …simply processing a large volume of
documents, generally into a few files
or one file and using intelligent
capture software to process.
Some products process folders of
documents on demand or “watch”
folders for files to process.
Batch Document
Processing
Learn more at What is Batch Document Processing?
32. Image Enhancement
• Adaptive thresholding
• Deskew
• Despeckle
• Remove blank pages or
separator sheets
• Auto rotate
• Remove lines
To improve usability and increase accuracy of OCR and other
recognition technologies, image enhancement is required.
Learn more at Improving OCR Accuracy with Cleanup and Enhancement.
34. Cloud Computing
Increased cloud computing will bring easily
accessible resources and repositories for
documents.
See Docs in the Clouds.
“The use of cloud computing is growing,
and by 2016 this growth will increase to
become the bulk of new IT spend.”
Gartner, Inc. Oct. 2013
35. Security Focus
Couple the increasing number of documents being
stored with the growing ways to access them, and
security concerns will continue to increase.
36. Improved Data Mining and
Classification
The increased used of data mining and better
classification will increase OCR demands and
lower the use of barcodes and separator pages.
37. Increased Mobility
Increased mobility demands in business impacts
all information technology. Users want all
information available from all platforms, no
matter when or where.