SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
1
ABBYY Compreno
Driving Impact from
Unstructured Information
Analytics
<NAME>
<DATE>
ABBYY Worldwide
2
Global
16 offices with more than 1.250 employees
in Europe, USA, Asia, Australia und Russia
Innovative
27% revenue investment in R&D,
more than 400 developers and scientists
Reliable
Connected
Trusted partner to over 1000 companies in
more than 150 countries around the world
Successful
> 40 million software users process more than
9 billion pages per year with ABBYY products
Enabling
Recognise, capture, (translate), analyse –
we transform information into action
Strong and independent core technology that
evolves with the needs of the digital revolution
Digital Universe
2.5 Exabyte of data generated every day = 2.5 Mio Terabyte = 2.5 x 1018 Byte
(source: Northwestern University, 2016)
 Majority (ca. 80%) is unstructured
3
1.4 x 1014 Word pages
3.5 x 1013 PPT slides
2 x 1013 PDF pages (image & text)
2 x 1014 emails
4 x 1013 scanned pages
3 x 1013 images (.tiff)
1.4 x 1016 .txt files
(source for average file sizes: netdocuments.com, 2016)
Reports, brochures, datasheets, presentations,
research documents, service documents,
pricelists, process descriptions, project
descriptions, product feature specifications,
customer communication, accident/security
reports, contracts, email, web texts, articles in
magazines, complete intranets …
Unstructured Content I
What do unstructured documents have in common?
● They are composed in natural language
What is the problem about natural language?
● Complex to analyse and summarise
● Does have a structure but is not standardized (different people use different terms, expressions, syntax to talk
about the same thing)
● Content is unexpected and cannot be processed with rules
● Limited/no metadata
4
Unstructured Content II
● The computer does not know what the document is about and there is no source to
get this information from
● Information is “locked” within documents
● Information that may be valuable, or confidential, business-critical, or defensibly deletable, but is
difficult to find and manage
 There is no business value in content that can’t be analysed or found
Natural language requires dedicated processing technology
5
ABBYY Compreno
6Confidential
What is it? Natural Language Processing (NLP) technology
What does it do? Advanced automated text analysis
● Gathers information about a document from the document
● Understands meaning of words within context
● Reveals relationships between words
● Builds stories across documents
● Extracts insights and intelligence from unstructured text
How Compreno works
Key Components
7
Semantics
Semantic analysis is used
to interpret syntactic structures
in terms of universal,
language-independent
concepts and their relations.
Syntax
Identifies formal relations among
words in a sentence or across
several sentences.
The system analyzes a text
and builds a tree of syntactic
relations.
Statistics
Data gleaned from parallel and monolingual corpora are
used for training the analysis algorithms and verifying and
expanding the formal descriptions available to the system.
Semantics
Syntax
Statistics
ABBYY Compreno
Platform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and
events from texts
8Confidential
What is classification?
To go from …
9
10
Mammals Birds Reptiles Fish
What is classification?
… to
Categorisation based on particular shared features
How document classification works
Three main steps
11
Training
Set up model, define categories,
select/collect training documents, train
model, choose best algorithm
Test and tune
Analyse test results, eliminate
mistakes, adjust training set,
retrain model
Classification
Deploy model to
production, classify
documents
Document classification – Why?
12
Essential step in information management
Enable advanced analysis and decision-
making
Generate business value
Why is classification not as easy as it seems?
Building up a reliable classification workflow is difficult…
13
Big Content
Technical challenges
- Big training sets
- Complex algorithms
- Difficult to integrate
Business challenges
- Traditional classification
methods don’t do the job
- High investments for
building and maintaining
the rule sets and
classification schemes
required (classification
expert knowledge)
New,
dedicated
processing
methods
required
Unstructured documents
ABBYY Smart Classifier
● Text classification module for organising unstructured documents
● Assign unseen documents to predefined categories based on statistical,
morphological and semantic analysis
● Uses supervised machine learning to produce a classification model from sample
inputs
● Classification creates meta data derived from the document context
14
Next generation document classification
Unstructured information processing
● Unlock information
● Make content searchable, accessible and retrievable
Automated classification
● High speed
● Constant quality
● No manual work
Semantic-based classification
● Deep text analysis techniques employed for even more accurate classification
15
Smart Classifier features and values
Smart Classifier features and values
Machine learning
● System learns automatically based on the training documents
● No particular knowledge required to setup classification
● No specification of rules necessary
● Small training sets
Automatic algorithm optimisation
● Selection of the best-performing algorithm for each document set
16
Smart Classifier features and values
Simple UI
● No specific knowledge required to create a model, train the system and launch a
classification workflow
Input document formats and languages
● Process content regardless of original format
● OCR for processing of images
● 39 classification languages
17
IT Integration of Smart Classifier
Leverage existing systems and infrastructure
18
Smart Classifier Workflows
19
Create and deploy classification model
01 | Category definition and selection of sample documents
02 | Setup of classification model
03 | Model training
04 | Model testing, quality evaluation and tuning
05 | Deployment to production
Document classification workflow
01| Category definition and selection of sample documents
● Category = a group of documents that have particular shared features
● Category definition is a management decision, no special IT skills required
● Content and process experts select representative documents for each category
● Minimum: 10 documents per category
● For reliable statistics: ±100 documents per category
● Representative sample of documents
● Documents must be typical for category: The more representative of the respective
category a document is, the better the model will perform (garbage in, garbage out).
● Proportion of docs assigned to each category should be the same as in the collection of
documents to be classified
● Smart Classifier accepts many formats (plain text, Office, HTML, XML, PDFs
(Image formats are submitted to OCR))
● Folder structure: Each (sub-)category = dedicated (sub-) folder
● Create training set and control set and save them as ZIP files
20
02| Setup of classification model
● The Classification Model defines, how and by which categories document
classification will be performed.
● Model creation via Model Editor web UI or REST API (code samples included in
documentation)
● Set parameters
● Document language (39 languages supported)
● Category assignment (what category will be assigned to the document if more than one was
returned as candidate category)
● Quality criteria (trade-off between precision and recall)
21
02| Setup of classification model
Model Editor web interface
22
03| Model training
● Load training documents
● Train classification model
● Machine learning
● The system automatically
identifies and uses the most
relevant features from the
training documents for
creating the classification
model
23
04| Model testing, quality evaluation and tuning
● Load and test control set to determine whether training process was successful
● Classification results in control set must meet expectations before model can be deployed
● Model Editor provides instant visibility of each document within a classification
project
● Source text and key words picked by the algorithms can be analysed and checked
● Terms that should be ignored during classification can be added to a stop word list
● Analyse: F-measure, precision, recall
● Debug: Confidence level, selected keywords
● Adjust: Inclusiveness, stop words, documents in classes (re-assign category)
● Upload further training/control documents
24
04| Model testing, quality evaluation and tuning
25
05| Deployment to production
● When the model is deployed it
becomes available via the
Compreno REST API
● If you make changes to the model,
it needs to be retrained for changes
to become effective
26Confidential
Document classification workflow
27Confidential
Once the system is set up and a classification model is published for operation, incoming
classification tasks will be accepted
01| A new document classification task is created
02| The document is converted into an internal format
03| The document is classified
04| The document classification results are saved
05| The task is completed
Document classification workflow
28Confidential
● Classification results in Model Editor
Smart Classifier application scenarios
Enterprise content management and its subdomains
Archiving, records management (Information Governance), document management,
enterprise search
● Classification of incoming and stored documents
● Definition of category-based access rights and retention policies
● Search enhancement
29
Smart Classifier application scenarios
Information lifecycle
Manage
Store
Archive
Dispose
Create
Capture
30
Classification of incoming
documents
Add documents to the
system that have a value, i.e.
are enhanced with metadata
Classification for aid in risk mitigation
Category-based document access
rights
Category-based disposal
policy
Classification for aid in compliance
Category-based retention policy
Classification to improve enterprise
search systems
Add class to search index
Category-based routing and distribution
Post-process
• Classification for metadata correction
• Classification of legacy content for data
improvement
Smart Classifier application scenarios
Data migration
● Organise content before, during or after migration
Client support
● Category-based prioritisation and routing of client issues shorten response times
eDiscovery
● Quickly gather and prepare documents
Mailroom
● Automatically select the most suitable processing workflow
E-mail management
● Additional metadata facilitates and accelerates routing
31
Smart Classifier benefits
For all enterprises
Create access to
information
Efficient
information
management
Aid compliance &
risk mitigation
Cost efficiency
32
Smart Classifier benefits
For ISVs
Create better
customer applications
Quick ROI
33
Smart Classifier benefits
For BPOs
Accelerate business
processes
34
Easier cost calculation
ABBYY Compreno
Platform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and
events from texts
35Confidential
ABBYY InfoExtractor SDK
● Information extraction module for processing natural language texts
● Natively processes unstructured documents and accesses the embedded textual
information
● Identifies different facts, entities and the relationship between them
● Automatically extracts critical data
● Combines related data into facts
36Confidential
How InfoExtractor works I
From text to semantics
Syntactic parsing: Determine the structure of the input text; understand how concepts relate to one another
within one or more sentences
Semantic parsing: Contextual analysis = Obtaining and representing the meaning of a
sentence
Universal Semantic Hierarchy: Language
independent hierarchy of concepts to reflect the
meaning and relations of words and sentences
Derive meaning of sentence by
understanding of the context and the
“speaker's” intent.
An ontology is a formal representation of
concepts and the relationships between
those concepts.
Lexical analysis: Convert sequence of characters into sequence of words
Morphological analysis: Analyse the structure of words and parts of words
Connect entities with other entities and facts, even if the words that define them are replaced with
pronouns or omitted in the text
Example: The company has denied reports it is preparing to default on its loans if it cannot reach
agreement on its bailout terms with international creditors
38
How InfoExtractor works II
Identify relationships between words
Get the
complete story
39
Gather only
relevant facts
How InfoExtractor works III
Define the contextual meaning of a word
Example: Some people work with PDF documents but not all employees do.
40
Don’t miss any
valuable facts
How InfoExtractor works IV
Detect omitted words
InfoExtractor features and values
41Confidential
Natural Language Processing
● Understand the meaning of words and relations between them
Extraction of entities and events
● Extract the facts and story lines embedded in unstructured information
● Persons, organisations, dates
● Deals, purchases, employment details
Identify relationships between entities and events
● Contracting parties, subject of the contract, financial figures
InfoExtractor features and values
Basic and custom ontologies
● Basic ontologies including widely used words
● Custom ontologies for industry solutions
Customized entities for specific cases
● Custom ontology dictionaries to extract complicated examples of entities (e.g. Asian
names or companies)
Input document formats and languages
● Work with text regardless of source
● English, Russian, German
● OCR embedded for image processing
42Confidential
IT Integration of InfoExtractor
The information extraction process
44
InfoExtractor application scenarios
Contract Management
● Use Case: Mass contract ingestion
● Document Type: Contract
● Customer: ISVs, Service Providers
● Benefit: Extend service offering & increase revenues
Customer On-Boarding
● Use Case: Capture & upload customer information at point of entry into the system
● Document Type: Statuary documents, contracts
● Customer: Banks, insurance companies
● Benefit: Accelerate document processing
45
InfoExtractor application scenarios
Applicant Tracking
● Use Case: Tag and upload CVs to improve search
● Document Type: CV
● Customer: HR departments
● Benefit: Minimise resources required to process all the necessary CVs
Credit Risk Mitigation
● Use Case: Decide on providing loans; check various sources of information on potential loan customers.
● Document Type: Contracts, statuary documents, court decisions
● Customer: Banks
● Benefit: Accelerate document processing
46
InfoExtractor benefits
Get decision-critical information with less costs and efforts
Intelligence and
insights
Aid predictive
decision making
Uncover hidden
risks
Cost efficiency
47
Use analytics to create
new value out of existing
and new data
Get the big picture by
connecting entities, facts
and events across
documents
Accelerate and automate
content upload and
analysis to optimise
manual processes
Take critical decisions
faster based on relevant
information
48
Good classification and information
extraction let organisations solve
tasks they are not capable of solving
at the moment
Smart Classifier and InfoExtractor
make document classification and
information extraction simple
Summary
Licensing
● Smart Classifier and InfoExtractor are available for testing via time and volume limited
trial license
● Different license models
● Perpetual with software maintenance
● Subscription (yearly)
● OEM licensing
● Standard license model based on renewable peak volume
● Backend can be scaled up
49
50
<Name>
<Name>@abbyy.com
ABBYY Europe GmbH
Elsenheimerstraße 49
80687 Munich
Germany
www.abbyy.com
51
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15madynav
 
Self Service Reporting & Analytics For an Enterprise
Self Service Reporting & Analytics For an EnterpriseSelf Service Reporting & Analytics For an Enterprise
Self Service Reporting & Analytics For an EnterpriseSreejith Madhavan
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...semanticsconference
 
Getting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint SyntexGetting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint SyntexChris Bortlik
 
KUNZHANG-resume-zdata
KUNZHANG-resume-zdataKUNZHANG-resume-zdata
KUNZHANG-resume-zdataKun Zhang
 
Understanding Identity Management with Office 365
Understanding Identity Management with Office 365Understanding Identity Management with Office 365
Understanding Identity Management with Office 365Perficient, Inc.
 
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...SPS Paris
 
Presentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligencePresentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligenceDEDE IRYAWAN
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationLooker
 
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...Amazon Web Services
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessInformatica
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...Dr. Haxel Consult
 
F.A.I.R. Data with Knowledge Graphs & AI
F.A.I.R. Data with Knowledge Graphs & AIF.A.I.R. Data with Knowledge Graphs & AI
F.A.I.R. Data with Knowledge Graphs & AIFredric Landqvist
 
Smart cities no ai without ia
Smart cities   no ai without iaSmart cities   no ai without ia
Smart cities no ai without iaFredric Landqvist
 
Crafting a Knowledge Graph Strategy - What to think about
Crafting a Knowledge Graph Strategy - What to think aboutCrafting a Knowledge Graph Strategy - What to think about
Crafting a Knowledge Graph Strategy - What to think aboutConnected Data World
 
Why You Need Metadata-Driven Records Management Webinar
Why You Need Metadata-Driven Records Management WebinarWhy You Need Metadata-Driven Records Management Webinar
Why You Need Metadata-Driven Records Management WebinarConcept Searching, Inc
 

Was ist angesagt? (20)

Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
Self Service Reporting & Analytics For an Enterprise
Self Service Reporting & Analytics For an EnterpriseSelf Service Reporting & Analytics For an Enterprise
Self Service Reporting & Analytics For an Enterprise
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
 
Getting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint SyntexGetting Ready for Project Cortex and SharePoint Syntex
Getting Ready for Project Cortex and SharePoint Syntex
 
KUNZHANG-resume-zdata
KUNZHANG-resume-zdataKUNZHANG-resume-zdata
KUNZHANG-resume-zdata
 
Understanding Identity Management with Office 365
Understanding Identity Management with Office 365Understanding Identity Management with Office 365
Understanding Identity Management with Office 365
 
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
B6 - An initiative to healthcare analytics with Office 365 & PowerBI - Thuan ...
 
Enterprise search
Enterprise searchEnterprise search
Enterprise search
 
Presentasi 1 - Business Intelligence
Presentasi 1 - Business IntelligencePresentasi 1 - Business Intelligence
Presentasi 1 - Business Intelligence
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI Organization
 
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business Success
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
So you want to be a Data Scientist?
So you want to be a Data Scientist?So you want to be a Data Scientist?
So you want to be a Data Scientist?
 
F.A.I.R. Data with Knowledge Graphs & AI
F.A.I.R. Data with Knowledge Graphs & AIF.A.I.R. Data with Knowledge Graphs & AI
F.A.I.R. Data with Knowledge Graphs & AI
 
Smart cities no ai without ia
Smart cities   no ai without iaSmart cities   no ai without ia
Smart cities no ai without ia
 
Crafting a Knowledge Graph Strategy - What to think about
Crafting a Knowledge Graph Strategy - What to think aboutCrafting a Knowledge Graph Strategy - What to think about
Crafting a Knowledge Graph Strategy - What to think about
 
Why You Need Metadata-Driven Records Management Webinar
Why You Need Metadata-Driven Records Management WebinarWhy You Need Metadata-Driven Records Management Webinar
Why You Need Metadata-Driven Records Management Webinar
 
Structured Content Meets Taxonomy
Structured Content Meets TaxonomyStructured Content Meets Taxonomy
Structured Content Meets Taxonomy
 

Andere mochten auch

Igor ostuchenko-analytics
Igor ostuchenko-analyticsIgor ostuchenko-analytics
Igor ostuchenko-analyticsSEMonline .Ru
 
Ariadne: First Report on Natural Language Processing
Ariadne:  First Report on Natural Language ProcessingAriadne:  First Report on Natural Language Processing
Ariadne: First Report on Natural Language Processingariadnenetwork
 
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботов
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботовHappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботов
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботовHappyDev-lite
 
Performance of Statistics Based Line Segmentation System for Unconstrained H...
Performance of Statistics Based Line Segmentation  System for Unconstrained H...Performance of Statistics Based Line Segmentation  System for Unconstrained H...
Performance of Statistics Based Line Segmentation System for Unconstrained H...AM Publications
 
Ontology-Based Systems Federation
Ontology-Based Systems FederationOntology-Based Systems Federation
Ontology-Based Systems FederationAnatoly Levenchuk
 
Оценка уверенности извлечения информации - Диалог 2016
Оценка уверенности извлечения информации - Диалог 2016Оценка уверенности извлечения информации - Диалог 2016
Оценка уверенности извлечения информации - Диалог 2016Andrew Belov
 
Document Recognition Market Landscape
Document Recognition Market LandscapeDocument Recognition Market Landscape
Document Recognition Market LandscapeChris Riley ☁
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
IDenTV Capabilities Overview 2017 (with Demos)
IDenTV Capabilities Overview 2017 (with Demos) IDenTV Capabilities Overview 2017 (with Demos)
IDenTV Capabilities Overview 2017 (with Demos) Amro Shihadah
 
ABBYY Technology Summit keynote
ABBYY Technology Summit keynoteABBYY Technology Summit keynote
ABBYY Technology Summit keynoteSandy Kemsley
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionJohn Liu
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY
 
Transform 2014: Introducing Kofax TotalAgility® Cloud
Transform 2014: Introducing Kofax TotalAgility® CloudTransform 2014: Introducing Kofax TotalAgility® Cloud
Transform 2014: Introducing Kofax TotalAgility® CloudKofax
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataPerficient, Inc.
 
[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy
[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy
[Webinar Slides] How to Increase Your Profits by Improving Your Data AccuracyAIIM International
 
The Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureThe Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureArturo Pelayo
 

Andere mochten auch (20)

Igor ostuchenko-analytics
Igor ostuchenko-analyticsIgor ostuchenko-analytics
Igor ostuchenko-analytics
 
Ariadne: First Report on Natural Language Processing
Ariadne:  First Report on Natural Language ProcessingAriadne:  First Report on Natural Language Processing
Ariadne: First Report on Natural Language Processing
 
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботов
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботовHappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботов
HappyDev-lite-2016-весна 01 Денис Нелюбин. Вкалывать на роботов
 
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned ImagesImprove OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
 
Performance of Statistics Based Line Segmentation System for Unconstrained H...
Performance of Statistics Based Line Segmentation  System for Unconstrained H...Performance of Statistics Based Line Segmentation  System for Unconstrained H...
Performance of Statistics Based Line Segmentation System for Unconstrained H...
 
Ontology-Based Systems Federation
Ontology-Based Systems FederationOntology-Based Systems Federation
Ontology-Based Systems Federation
 
Оценка уверенности извлечения информации - Диалог 2016
Оценка уверенности извлечения информации - Диалог 2016Оценка уверенности извлечения информации - Диалог 2016
Оценка уверенности извлечения информации - Диалог 2016
 
SECh78
SECh78SECh78
SECh78
 
Document Recognition Market Landscape
Document Recognition Market LandscapeDocument Recognition Market Landscape
Document Recognition Market Landscape
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
IDenTV Capabilities Overview 2017 (with Demos)
IDenTV Capabilities Overview 2017 (with Demos) IDenTV Capabilities Overview 2017 (with Demos)
IDenTV Capabilities Overview 2017 (with Demos)
 
ABBYY Technology Summit keynote
ABBYY Technology Summit keynoteABBYY Technology Summit keynote
ABBYY Technology Summit keynote
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
ABBYY USA TAWPI presentation
ABBYY USA TAWPI presentationABBYY USA TAWPI presentation
ABBYY USA TAWPI presentation
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
Transform 2014: Introducing Kofax TotalAgility® Cloud
Transform 2014: Introducing Kofax TotalAgility® CloudTransform 2014: Introducing Kofax TotalAgility® Cloud
Transform 2014: Introducing Kofax TotalAgility® Cloud
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
 
[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy
[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy
[Webinar Slides] How to Increase Your Profits by Improving Your Data Accuracy
 
The Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The FutureThe Future Of Work & The Work Of The Future
The Future Of Work & The Work Of The Future
 

Ähnlich wie Introducing Compreno - Natural Language Processing Technology

Introduction to Microsoft Syntex
Introduction to Microsoft SyntexIntroduction to Microsoft Syntex
Introduction to Microsoft SyntexDrew Madelung
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
 
UiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPathCommunity
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
AdityaSharma_Analyst.doc
AdityaSharma_Analyst.docAdityaSharma_Analyst.doc
AdityaSharma_Analyst.docAditya Sharma
 
Data modelling tool in CASE
Data modelling tool in CASEData modelling tool in CASE
Data modelling tool in CASEManju Pillai
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Getting started with with SharePoint Syntex
Getting started with with SharePoint SyntexGetting started with with SharePoint Syntex
Getting started with with SharePoint SyntexDrew Madelung
 
Transition to a modern data platform
Transition to a modern data platform Transition to a modern data platform
Transition to a modern data platform Michael Ghen
 
Scaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellIT Arena
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActionsRob Winters
 

Ähnlich wie Introducing Compreno - Natural Language Processing Technology (20)

DU Series - Day 4.pptx
DU Series - Day 4.pptxDU Series - Day 4.pptx
DU Series - Day 4.pptx
 
Introduction to Microsoft Syntex
Introduction to Microsoft SyntexIntroduction to Microsoft Syntex
Introduction to Microsoft Syntex
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
UiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptx
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
Resume
ResumeResume
Resume
 
AdityaSharma_Analyst.doc
AdityaSharma_Analyst.docAdityaSharma_Analyst.doc
AdityaSharma_Analyst.doc
 
OpenKM commercial
OpenKM commercialOpenKM commercial
OpenKM commercial
 
Data modelling tool in CASE
Data modelling tool in CASEData modelling tool in CASE
Data modelling tool in CASE
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Getting started with with SharePoint Syntex
Getting started with with SharePoint SyntexGetting started with with SharePoint Syntex
Getting started with with SharePoint Syntex
 
Transition to a modern data platform
Transition to a modern data platform Transition to a modern data platform
Transition to a modern data platform
 
System design
System designSystem design
System design
 
Subhasis Mukherjee
Subhasis Mukherjee Subhasis Mukherjee
Subhasis Mukherjee
 
Scaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AI
 
Microsoft SharePoint Syntex
Microsoft SharePoint SyntexMicrosoft SharePoint Syntex
Microsoft SharePoint Syntex
 
A Busy Lawyer’s Guide to Managing Documents and Court Forms
A Busy Lawyer’s Guide to Managing Documents and Court FormsA Busy Lawyer’s Guide to Managing Documents and Court Forms
A Busy Lawyer’s Guide to Managing Documents and Court Forms
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 

Kürzlich hochgeladen

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 

Kürzlich hochgeladen (20)

Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 

Introducing Compreno - Natural Language Processing Technology

  • 1. 1 ABBYY Compreno Driving Impact from Unstructured Information Analytics <NAME> <DATE>
  • 2. ABBYY Worldwide 2 Global 16 offices with more than 1.250 employees in Europe, USA, Asia, Australia und Russia Innovative 27% revenue investment in R&D, more than 400 developers and scientists Reliable Connected Trusted partner to over 1000 companies in more than 150 countries around the world Successful > 40 million software users process more than 9 billion pages per year with ABBYY products Enabling Recognise, capture, (translate), analyse – we transform information into action Strong and independent core technology that evolves with the needs of the digital revolution
  • 3. Digital Universe 2.5 Exabyte of data generated every day = 2.5 Mio Terabyte = 2.5 x 1018 Byte (source: Northwestern University, 2016)  Majority (ca. 80%) is unstructured 3 1.4 x 1014 Word pages 3.5 x 1013 PPT slides 2 x 1013 PDF pages (image & text) 2 x 1014 emails 4 x 1013 scanned pages 3 x 1013 images (.tiff) 1.4 x 1016 .txt files (source for average file sizes: netdocuments.com, 2016) Reports, brochures, datasheets, presentations, research documents, service documents, pricelists, process descriptions, project descriptions, product feature specifications, customer communication, accident/security reports, contracts, email, web texts, articles in magazines, complete intranets …
  • 4. Unstructured Content I What do unstructured documents have in common? ● They are composed in natural language What is the problem about natural language? ● Complex to analyse and summarise ● Does have a structure but is not standardized (different people use different terms, expressions, syntax to talk about the same thing) ● Content is unexpected and cannot be processed with rules ● Limited/no metadata 4
  • 5. Unstructured Content II ● The computer does not know what the document is about and there is no source to get this information from ● Information is “locked” within documents ● Information that may be valuable, or confidential, business-critical, or defensibly deletable, but is difficult to find and manage  There is no business value in content that can’t be analysed or found Natural language requires dedicated processing technology 5
  • 6. ABBYY Compreno 6Confidential What is it? Natural Language Processing (NLP) technology What does it do? Advanced automated text analysis ● Gathers information about a document from the document ● Understands meaning of words within context ● Reveals relationships between words ● Builds stories across documents ● Extracts insights and intelligence from unstructured text
  • 7. How Compreno works Key Components 7 Semantics Semantic analysis is used to interpret syntactic structures in terms of universal, language-independent concepts and their relations. Syntax Identifies formal relations among words in a sentence or across several sentences. The system analyzes a text and builds a tree of syntactic relations. Statistics Data gleaned from parallel and monolingual corpora are used for training the analysis algorithms and verifying and expanding the formal descriptions available to the system. Semantics Syntax Statistics
  • 8. ABBYY Compreno Platform for document understanding Core uses of Compreno technology ● Classify unstructured documents ● Identify and extract entities, facts and events from texts 8Confidential
  • 10. 10 Mammals Birds Reptiles Fish What is classification? … to Categorisation based on particular shared features
  • 11. How document classification works Three main steps 11 Training Set up model, define categories, select/collect training documents, train model, choose best algorithm Test and tune Analyse test results, eliminate mistakes, adjust training set, retrain model Classification Deploy model to production, classify documents
  • 12. Document classification – Why? 12 Essential step in information management Enable advanced analysis and decision- making Generate business value
  • 13. Why is classification not as easy as it seems? Building up a reliable classification workflow is difficult… 13 Big Content Technical challenges - Big training sets - Complex algorithms - Difficult to integrate Business challenges - Traditional classification methods don’t do the job - High investments for building and maintaining the rule sets and classification schemes required (classification expert knowledge) New, dedicated processing methods required Unstructured documents
  • 14. ABBYY Smart Classifier ● Text classification module for organising unstructured documents ● Assign unseen documents to predefined categories based on statistical, morphological and semantic analysis ● Uses supervised machine learning to produce a classification model from sample inputs ● Classification creates meta data derived from the document context 14 Next generation document classification
  • 15. Unstructured information processing ● Unlock information ● Make content searchable, accessible and retrievable Automated classification ● High speed ● Constant quality ● No manual work Semantic-based classification ● Deep text analysis techniques employed for even more accurate classification 15 Smart Classifier features and values
  • 16. Smart Classifier features and values Machine learning ● System learns automatically based on the training documents ● No particular knowledge required to setup classification ● No specification of rules necessary ● Small training sets Automatic algorithm optimisation ● Selection of the best-performing algorithm for each document set 16
  • 17. Smart Classifier features and values Simple UI ● No specific knowledge required to create a model, train the system and launch a classification workflow Input document formats and languages ● Process content regardless of original format ● OCR for processing of images ● 39 classification languages 17
  • 18. IT Integration of Smart Classifier Leverage existing systems and infrastructure 18
  • 19. Smart Classifier Workflows 19 Create and deploy classification model 01 | Category definition and selection of sample documents 02 | Setup of classification model 03 | Model training 04 | Model testing, quality evaluation and tuning 05 | Deployment to production Document classification workflow
  • 20. 01| Category definition and selection of sample documents ● Category = a group of documents that have particular shared features ● Category definition is a management decision, no special IT skills required ● Content and process experts select representative documents for each category ● Minimum: 10 documents per category ● For reliable statistics: ±100 documents per category ● Representative sample of documents ● Documents must be typical for category: The more representative of the respective category a document is, the better the model will perform (garbage in, garbage out). ● Proportion of docs assigned to each category should be the same as in the collection of documents to be classified ● Smart Classifier accepts many formats (plain text, Office, HTML, XML, PDFs (Image formats are submitted to OCR)) ● Folder structure: Each (sub-)category = dedicated (sub-) folder ● Create training set and control set and save them as ZIP files 20
  • 21. 02| Setup of classification model ● The Classification Model defines, how and by which categories document classification will be performed. ● Model creation via Model Editor web UI or REST API (code samples included in documentation) ● Set parameters ● Document language (39 languages supported) ● Category assignment (what category will be assigned to the document if more than one was returned as candidate category) ● Quality criteria (trade-off between precision and recall) 21
  • 22. 02| Setup of classification model Model Editor web interface 22
  • 23. 03| Model training ● Load training documents ● Train classification model ● Machine learning ● The system automatically identifies and uses the most relevant features from the training documents for creating the classification model 23
  • 24. 04| Model testing, quality evaluation and tuning ● Load and test control set to determine whether training process was successful ● Classification results in control set must meet expectations before model can be deployed ● Model Editor provides instant visibility of each document within a classification project ● Source text and key words picked by the algorithms can be analysed and checked ● Terms that should be ignored during classification can be added to a stop word list ● Analyse: F-measure, precision, recall ● Debug: Confidence level, selected keywords ● Adjust: Inclusiveness, stop words, documents in classes (re-assign category) ● Upload further training/control documents 24
  • 25. 04| Model testing, quality evaluation and tuning 25
  • 26. 05| Deployment to production ● When the model is deployed it becomes available via the Compreno REST API ● If you make changes to the model, it needs to be retrained for changes to become effective 26Confidential
  • 27. Document classification workflow 27Confidential Once the system is set up and a classification model is published for operation, incoming classification tasks will be accepted 01| A new document classification task is created 02| The document is converted into an internal format 03| The document is classified 04| The document classification results are saved 05| The task is completed
  • 28. Document classification workflow 28Confidential ● Classification results in Model Editor
  • 29. Smart Classifier application scenarios Enterprise content management and its subdomains Archiving, records management (Information Governance), document management, enterprise search ● Classification of incoming and stored documents ● Definition of category-based access rights and retention policies ● Search enhancement 29
  • 30. Smart Classifier application scenarios Information lifecycle Manage Store Archive Dispose Create Capture 30 Classification of incoming documents Add documents to the system that have a value, i.e. are enhanced with metadata Classification for aid in risk mitigation Category-based document access rights Category-based disposal policy Classification for aid in compliance Category-based retention policy Classification to improve enterprise search systems Add class to search index Category-based routing and distribution Post-process • Classification for metadata correction • Classification of legacy content for data improvement
  • 31. Smart Classifier application scenarios Data migration ● Organise content before, during or after migration Client support ● Category-based prioritisation and routing of client issues shorten response times eDiscovery ● Quickly gather and prepare documents Mailroom ● Automatically select the most suitable processing workflow E-mail management ● Additional metadata facilitates and accelerates routing 31
  • 32. Smart Classifier benefits For all enterprises Create access to information Efficient information management Aid compliance & risk mitigation Cost efficiency 32
  • 33. Smart Classifier benefits For ISVs Create better customer applications Quick ROI 33
  • 34. Smart Classifier benefits For BPOs Accelerate business processes 34 Easier cost calculation
  • 35. ABBYY Compreno Platform for document understanding Core uses of Compreno technology ● Classify unstructured documents ● Identify and extract entities, facts and events from texts 35Confidential
  • 36. ABBYY InfoExtractor SDK ● Information extraction module for processing natural language texts ● Natively processes unstructured documents and accesses the embedded textual information ● Identifies different facts, entities and the relationship between them ● Automatically extracts critical data ● Combines related data into facts 36Confidential
  • 37. How InfoExtractor works I From text to semantics Syntactic parsing: Determine the structure of the input text; understand how concepts relate to one another within one or more sentences Semantic parsing: Contextual analysis = Obtaining and representing the meaning of a sentence Universal Semantic Hierarchy: Language independent hierarchy of concepts to reflect the meaning and relations of words and sentences Derive meaning of sentence by understanding of the context and the “speaker's” intent. An ontology is a formal representation of concepts and the relationships between those concepts. Lexical analysis: Convert sequence of characters into sequence of words Morphological analysis: Analyse the structure of words and parts of words
  • 38. Connect entities with other entities and facts, even if the words that define them are replaced with pronouns or omitted in the text Example: The company has denied reports it is preparing to default on its loans if it cannot reach agreement on its bailout terms with international creditors 38 How InfoExtractor works II Identify relationships between words Get the complete story
  • 39. 39 Gather only relevant facts How InfoExtractor works III Define the contextual meaning of a word
  • 40. Example: Some people work with PDF documents but not all employees do. 40 Don’t miss any valuable facts How InfoExtractor works IV Detect omitted words
  • 41. InfoExtractor features and values 41Confidential Natural Language Processing ● Understand the meaning of words and relations between them Extraction of entities and events ● Extract the facts and story lines embedded in unstructured information ● Persons, organisations, dates ● Deals, purchases, employment details Identify relationships between entities and events ● Contracting parties, subject of the contract, financial figures
  • 42. InfoExtractor features and values Basic and custom ontologies ● Basic ontologies including widely used words ● Custom ontologies for industry solutions Customized entities for specific cases ● Custom ontology dictionaries to extract complicated examples of entities (e.g. Asian names or companies) Input document formats and languages ● Work with text regardless of source ● English, Russian, German ● OCR embedded for image processing 42Confidential
  • 43. IT Integration of InfoExtractor
  • 45. InfoExtractor application scenarios Contract Management ● Use Case: Mass contract ingestion ● Document Type: Contract ● Customer: ISVs, Service Providers ● Benefit: Extend service offering & increase revenues Customer On-Boarding ● Use Case: Capture & upload customer information at point of entry into the system ● Document Type: Statuary documents, contracts ● Customer: Banks, insurance companies ● Benefit: Accelerate document processing 45
  • 46. InfoExtractor application scenarios Applicant Tracking ● Use Case: Tag and upload CVs to improve search ● Document Type: CV ● Customer: HR departments ● Benefit: Minimise resources required to process all the necessary CVs Credit Risk Mitigation ● Use Case: Decide on providing loans; check various sources of information on potential loan customers. ● Document Type: Contracts, statuary documents, court decisions ● Customer: Banks ● Benefit: Accelerate document processing 46
  • 47. InfoExtractor benefits Get decision-critical information with less costs and efforts Intelligence and insights Aid predictive decision making Uncover hidden risks Cost efficiency 47 Use analytics to create new value out of existing and new data Get the big picture by connecting entities, facts and events across documents Accelerate and automate content upload and analysis to optimise manual processes Take critical decisions faster based on relevant information
  • 48. 48 Good classification and information extraction let organisations solve tasks they are not capable of solving at the moment Smart Classifier and InfoExtractor make document classification and information extraction simple Summary
  • 49. Licensing ● Smart Classifier and InfoExtractor are available for testing via time and volume limited trial license ● Different license models ● Perpetual with software maintenance ● Subscription (yearly) ● OEM licensing ● Standard license model based on renewable peak volume ● Backend can be scaled up 49

Hinweis der Redaktion

  1. ABBYY is a leading provider of text recognition and document conversion technologies and services. Operating globally, ABBYY is headquartered in Moscow, Russia, with offices in Germany, the UK, the United States, Canada, Ukraine, Cyprus, Australia, Japan and Taiwan. ABBYY offers a broad range of solutions designed for specific business and industry needs, ideally suited to meet their individual requirements while seamlessly integrating in internal workflows. Organisations all over the world use ABBYY solutions to optimise their paper-intensive business processes.
  2. Key components ABBYY Compreno uses three major components — semantics (in the form of a language-independent hierarchy of concepts), syntax (i.e. the ability to understand how concepts relate to one another within one or more sentences) and statistical data, which is used for combining words into natural-sounding sequences and as an aid in sense disambiguation. Language-independent hierarchy of concepts = Universal Semantic Hierarchy (USH) Key to ABBYY’s Compreno technology is the idea that people speak in different languages but think using similar concepts. For example all people live in houses, have furniture, use phones, or drive cars. These concepts are common to all people and are language-independent. Therefore, we can build a semantic hierarchy of concepts that will work for all languages. The ABBYY Compreno semantic hierarchy is a tree-like structure, with the thick branches representing more general concepts (e.g. “furniture”) and the thin branches representing more specific concepts (e.g. “bed”, “cupboard”, “chair”). This tree-like structure contains information about the combinability of its items and allows them to inherit properties from their parents. This approach helps resolve ambiguities during translation and provides more relevant search results. For example, there are different branches for the verb “to possess” in the hierarchy, one describing the idea of owning material things, and the other the ability of ideas, emotions and the like to dominate somebody’s mind. Syntax The syntax component detects how concepts are related to one another within one or more sentences. The system analyzes texts and builds a tree of syntactic relations. To make syntactic parsing more accurate, ABBYY Compreno also relies on semantic analysis that makes use of the hierarchy of concepts described above. Joint use of the above components enables the system to “understand” sentences and either extract knowledge from them or express this understanding in another language. Statistics The third major component is statistics. ABBYY Compreno uses statistical data to generate naturally sounding word combinations and to better resolve ambiguities, which is necessary for correct parsing. Statistics are also used to distinguish homonyms in cases when even the semantic component does not provide a reliable answer. The statistical component uses texts of different genres and registers to reduce the likelihood of error and misinterpretation.
  3. ABBYY Compreno is a natural language processing (NLP) technology that enables you to extract insights and intelligence from unstructured text. ABBYY Compreno technology “understands” the meaning of words, reveals the relationships between them within content and uses this understanding to provide comprehensive text analysis that accurately identify entities, facts, events and relationships between them to discover the stories within textual documents.
  4. Why do we need content classification at all? Classification is an essential step in almost any kind of information or content management process. Content can be routed through a process or assigned to a specific workflow according to class, Category tagged content enhances enterprise search systems and allows knowledge workers to navigate through and retrieve information from huge repositories of data Categories can be used in archiving content Classification enables enterprises to leverage content, it creates access to information. In the classification process, incoming or stored content is recognised, differentiated and categorised for the purpose of further processing. Classification provides the basis for advanced text analysis, information extraction and information-based decision making Classification not only helps businesses manage the tidal wave of data but generates business value also.
  5. If classification is such an important step in information management why do so few organisations actually practice it? Why is classification obviously not as easy as it seems? We can best answer this question when looking at the challenges enterprises face when it comes down to content classification: Big Content Today, the volume, velocity and variety of content generation are constantly increasing. Enterprises have to deal with huge data volumes, that they need to process and store. The more data there are, the harder it gets to search and locate critical data. Unstructured format The vast majority of information today is unstructured and composed in natural language. The problem about this type of content is that it is difficult to analyze and summarize because information is not standardized but unexpected and cannot be processed with extraction rules. As there is no or only limited metadata, the computer does not know what a document is about. The information is literally locked within the format and therefore unsearchable - information that may be valuable, or confidential, business-critical, or defensibly deletable, but is difficult to find and manage. In consequence, there is no business value in content that can’t be analyzed or found. These challenges come along with a variety of technical challenges Training of a classification systems requires many documents Classification algorithms are hard to understand, parameters tuning is complex (if you do not know what how certain algorithms behave, how to know if you can trust and depend on the results) Integration with existing enterprise systems and platforms is complicated or not possible at all (scientific classification libraries often work with plain text, no support for office formats, PDFs or images) This on the other hand entails business challenges: Traditional classification (manual, rule based) cannot meet these requirements any more. Manual classification is expensive, slow, inconsistent (accuracy differs between individuals), quality deteriorates with increasing volumes and time pressure. Rule based systems are basically unworkable for Big Content High investments are required because classification is a complex domain and typically requires a skilled expert for setting up the classification workflow and developing, training and tuning the classification algorithm(s). All this causes most classification projects to go unfinished. To successfully manage these challenges and build up a reliable classification workflow new, dedicated processing technologies are required.
  6. How does Smart Classifier solve the problem?... Smart Classifier is a new, high-quality text classification module that has been designed for processing unstructured documents. Smart Classifier assigns unseen documents to predefined categories based on morphological, statistical and semantic analysis of extracted text. Smart Classifier uses supervised machine learning to automatically identify and use the most relevant features from a set of training documents, i.e. sample inputs, to build the classification model. Smart Classifier gathers information about the document from the document and adds this information to the document as meta data. The classification result is a probability score for a single or multiple categories.
  7. Unstructured information processing Smart Classifier enables enterprises to unlock information from unstructured documents, turn it into an asset and use it to their advantage. In the classification step, content is converted to a searchable format and tagged with contextual metadata. Automated classification Automated classification overcomes most problems associated with manual classification High speed: Quickly classify incoming documents Classify huge backlogs/repositories Constant quality: Manual classification quality deteriorates significantly under tight timelines Manual classification quality varies between people No manual work Knowledge workers can focus on problem solving Semantic based classification Smart Classifier combines linguistics and statistics with semantic analysis for even more accurate classification. This functionality is currently available for Russian and English (German to come).
  8. Machine learning Smart Classifier applies machine learning algorithms to automatically train on small sets of sample documents and select the most appropriate classification features, i.e. it determines which features within the sample documents characterise each category. The setup, training and deployment of classification in Smart Classifier does not require any specific knowledge. It is not necessary, as with traditional rule based systems, to specify rule sets or to manually train and tune models with huge quantities of training documents. The documents used for model training do not need to be pre-processed in any way. Automatic algorithm optimisation During the machine learning phase, Smart Classifier automatically tests multiple algorithms and selects the best-performing model and classification parameters for each document set. This makes the time intensive process of manual model tuning obsolete.
  9. Simple UI The Model Editor web interface is accessible for any business user to easily and quickly create and tune classification models. Via Model Editor you can Create classification projects Set up classification models Load training documents Train models Evaluate classification performance/Quality check Refine models Code samples for the Model Editor UI are included in documentation The admin console provides an interface for IT staff for administration of Smart Classifier. Variety of input document languages and formats Smart Classifier natively processes a large variety of document formats including plain text, Microsoft® Office formats, HTML, PDFs, images, XML, and more. Image formats are pre-processed with OCR to extract text. Smart Classifier extracts the plain text from documents and uses it for classification. The extracted text can be saved for further processing or re-classification. Smart Classifier offers automatic language detection and document classification for all major European and Asian languages.
  10. Smart Classifier comprises multiple components for setup, training and administration of classification models and processing of classification tasks: Processing Components: Control Server/Service - System service that distributes tasks among the Processing Services. Processing Station/Service - System service that processes documents in tasks assigned by the Control Service. Admin Console - Administrative tool for managing ABBYY Smart Classifier (user accounts, licenses, tasks, event log, Classification Model Server/Compreno Technology Module - Software component that contains classification algorithms and information extraction rules. (Smart Classifier Data Service - System service that enables working with classification models) Setup and training: Model Editor – Web-based user interface for creating and managing classification projects and models. Smart Classifier exists as a stand-alone entity, an external brain so to speak. It works as a service, is not domain specific and does not require a hard-coded classification workflow. Smart Classifier can process content from multiple sources like internal file share, email server, document repository, DMS, RMS. Through its simple REST API Smart Classifier can easily be integrated into an existing IT environment. Classification tasks and results are exchanged via the REST API: Communication is carried out via HTTP calls that produce responses in JSON or RDF/XML format Classification tasks can be submitted in synchronous, asynchronous or batch (.zip file) mode, depending on their amount and complexity. The REST API can also be used for classification model setup, training and quality check (license parameter) Smart Classifier provides two output formats for classification results, JSON or RDF/XML. Results include information such as name of the classification model, categories with their probabilities, confidentiality flags, feature/word lists, access to the raw text (add-on license parameter) or error messages. This information needs to be further processed in existing systems, workflows and solutions in order to derive value from it. Scalable, server based architecture Smart Classifier is based on a scalable backend, capable of processing large amounts of files. For a high throughput, it can be scaled both horizontally and vertically with additional processing resources. The maximum horizontal scalability is 20 processing services.
  11. 1. A new document classification task is created. Tasks are created using the REST API. The Control Service chooses one of the available Processing Services and allocates the task to it. The task is then sent to the Processing Service. 2. The document is converted into an internal format. The Processing Service converts the document into an internal format. If any text in the document requires optical character recognition (OCR), the station uses a built-in component to recognize the text. The availability of the OCR feature is determined by your current license. 3. The document is classified. An executor requests the binary representation of the trained model from the Smart Classifier Data Service and classifies the document using the model. 4. The document classification results are saved. The classification results are saved to an RDF/XML or a JSON file. 5. The task is completed. The Control Service receives the RDF/XML or JSON file from the Processing Service and flags the task as completed. The task results may be obtained by means of the REST API
  12. Smart Classifier can be deployed in a variety of scenarios across processes, workflows and projects. Enterprise Content Management The assumption is that probably every enterprise practices some sort of enterprise content management, be it using a file share, a simple workflow a fully fledge ECM solution or else. Enterprise content management is an umbrella term and encompasses, amongst others, archiving, records management (today called Information Governance), document management and enterprise search. High-performance classification of unstructured content allows us to quickly organise large repositories and enables knowledge workers to efficiently search and locate information critical to their work. In this context, Smart Classifier can be applied in the following tasks Classify incoming documents to not simply add content to the system but add content that has a value, i.e. tagged with metadata Once classified, incoming documents can be routed to their respective recipients based on category Organise legacy content in projects  identify and remove redundant, obsolete and trivial (ROT) content Ensure compliance with regulatory and audit requirements by definition of category-based document access rights to guarantee data security category based retention policies, i.e. ensure that every important document is stored as long as it should be with accordance to the records management policies (defensible disposal) Search enhancement: Generate additional metadata out of incoming and archived content and let knowledge professionals easily search and retrieve critical content via new facets
  13. Besides enterprise content management there are other potential application scenarios for Smart Classifier Data migration Organise content before, during or after migration  what to take and what to leave behind Identify and remove duplicate and unnecessary content Reduce volume of content to be migrated Enterprises that go through events like M&As, corporate restructuring, system migration, system/storage consolidation, digitisation projects, and more that trigger need for content migration Client support: Faced daily with tons of client issues, customer support employees need to classify, prioritise and route these. Automatic semantic-based classification can help to overcome this by shortening response times, improving customer satisfaction and retention eDiscovery: Quickly gather and prepare documents for eDiscovery, audits and litigation Mailroom. Automatically select the most suitable processing workflow, e.g. data extraction, straightly archiving, …. E-mail management: Organising e-mails manually is painful, missing business critical messages from customers or suppliers is even more painful. Metadata (such as "to", "from") is rarely good enough. Using both metadata and content, new semantic-based classification automatically distinguishes the "wheat from the chaff".
  14. We can derive the following benefits from Smart Classifier features and values…. Create access to information Smart Classifier supports enterprises in accessing unstructured information, turning it into an asset and using it to their advantage. Content and process experts can setup and maintain the classification, no special IT skills are required. In unlocking information from the unstructured format, Smart Classifier makes content usable for downstream processes and routines. Classification provides the basis for advanced text analysis, information extraction and decision making. Efficient information management High-performance classification of unstructured content allows us to quickly organise large repositories and enables knowledge workers to efficiently search and locate information critical to their work Automated classification with Smart Classifier greatly simplifies the entire classification process: It becomes easier, faster, more reliable and less costly. The quality of classification is always the same irrespective of workload. Smart Classifier enables enterprises to quickly organize and prioritize unstructured content with category-based document routing, archiving, and filtering so that knowledge professionals can efficiently search and locate information critical for a variety of business tasks. Automatic routing of incoming documents allows the acceleration and automatic selection of the most suitable category, workflow or responsible person. Aid compliance & risk mitigation Granular text- and semantic-based classification enables organisations to keep up with security, compliance and records management requirements. This is especially important given the impending EU GDPR regulation. Automatic content classification enables you to identify data that should be discarded or archived at a targeted, granular level. Keep only the data that has a value and requires to be kept and get rid of data silos that only adds additional storage costs. Minimize risk of data leakage or loss: Arrange your data leakage protection – make sure your confidential data is under control, does not flow outside and cannot be accessed by outsiders by applying content-aware classification-based access rights to documents. Cost efficiency With implementation of Smart Classifier enterprises increase the automation of organizational processes, while reducing processing costs. Less investments in manual work are required since most of the manual work associated with model training and tuning has been eliminated. Knowledge workers now can focus on problem solving. As a result, cost calculation becomes more reliable. Identify and delete content that is redundant, obsolete or trivial (ROT) to reduce the space needed for storage Smart Classifier can be easily integrated into information management routines to leverage existing infrastructure and investments
  15. Create better customer applications Extend the capabilities of existing product portfolio with easy to use classification Enhance value proposition to your customers  be innovative, offer new differentiator/USP High usability: no special skills required on customer side to setup and maintain classification, content/process experts can do it Quick ROI Fast and cost-effective tool deployment with detailed documentation and code samples Leverage and build upon existing investments in classification
  16. Accelerate business processes Enhance the efficiency of business processes to serve your customers better and faster Easier cost calculation Automated classification makes cost calculation easier because no manual work has to be planned and paid for. Automated classification is resistant to volume fluctuations at constant quality of classification results. Save your customers costs by reducing staff resources
  17. Classification is the first step to advanced text analysis and understanding. Once classified and tagged with contextual metadata, information is ready for further processing like search and retrieval, automated routing, intelligent data extraction and decision-making. That brings us to the second ABBYY product powered by Compreno technology – InfoExtractor. ABBYY InfoExtractor is a information extraction module that “understands” the meaning of words and identifies and extracts critical information from unstructured texts.
  18. InfoExtractor takes up where Smart Classifier stops. It powers business tasks that require granular content analysis and understanding. InfoExtractor provides comprehensive text analytics by automatically identifying and extracting business-relevant information from your content. It delivers insights and intelligence from unstructured information like contracts and reports InfoExtractor applies deep linguistic analyses on the text in natural language to identify entities, persons, facts and relationships between them. However, not everything that is extracted from a sentence or document is wanted/needed. That’s why InfoExtractor “distills” the relevant information/facts/relationships. InfoExtractor is an SDK: The extraction logic is very customer, project and domain specific. For different purposes different ontologies are necessary.
  19. ABBYY's new approach The ABBYY InfoExtractor (based on Compreno technology) analyses the text with different linguistic and statistical approaches. This results in massive meta-data that is created out of simple text. These “raw” linguistic hypotheses are then weighted, cross-checked with the embedded language and grammar rules. The best hypotheses are then matched against ABBYY's Universal Semantic Hierarchy to get the real (semantic) meaning and the context how the word is used in this sentence.
  20. Natural Language Processing Powered by Compreno technology, InfoExtractor understands the meaning of words and relations between them. Extraction of entities and events InfoExtractor accurately extracts information like entities, e.g. persons, organisations or dates, and facts, e.g. deals, purchases, employment, familiy relationships etc. from unstructured texts. Identify relationships between entities and events InfoExtractor identifies relationships between entites and facts like the subject of a contract (what is the contract about), who the involved parties are (related personal information) and what their roles (seller/buyer, employer/employee) are. Analyse the deal that links a buyer and a seller and identify the related personal info, contacts or financial figures
  21. Basic and custom ontologies InfoExtractor SDK comes with basic ontologies that include widely used words Industry ontologies for specific domains or tasks can be efficiently customized or created with the help of ABBYY professional linguistic services Customized entities for specific cases Custom ontology dictionaries can be used to handle particularly tough cases such as rare Asian names of people and companies. New entities will automatically inherit existing extraction rules and require no additional descriptions. Input document formats and languages InfoExtractor natively processes a large variety of document formats including plain text, Microsoft® Office formats, HTML, PDFs, images, XML, and more. It extracts the plain text out of documents and uses it for analysis. InfoExtractor can process texts in English, Russian and German. Image formats are pre-processed with OCR to extract text.
  22. InfoExtractor is a server-based module that works as a standalone entity within existing IT systems or can be integrated into solutions. It works as a service, is not domain specific and does not require a hard-coded workflow. InfoExtractor can process content from multiple sources like internal file share, email server, document repository, DMS, RMS. InfoExtractor comprises multiple components for setup, training and administration of classification models and processing of classification tasks: Control Server/Service - System service that distributes tasks among the Processing Services. Processing Station/Service - System service that processes documents in tasks assigned by the Control Service. Technology Module - Software component that contains classification algorithms and information extraction rules. Admin Console - Administrative tool for managing ABBYY Smart Classifier (user accounts, licenses, tasks, event log) Custom Data Server - A system service that enables working with semantic and ontology user dictionaries and optimizes the algorithm that calculates confidence scores for extracted data. Through its simple REST API InfoExtractor can easily be integrated into an existing IT environment. Info extraction tasks and results are exchanged via the REST API: Communication is carried out via HTTP calls that produce responses in JSON or RDF/XML format Tasks can be submitted in synchronous or asynchronous mode, depending on their amount and complexity. InfoExtractor provides two output formats for results, JSON or RDF/XML. The results contain information about entities, facts, and events, confidentiality flags, access to the raw text (add-on license parameter) or error messages. This information needs to be further processed in existing systems, workflows and solutions in order to derive value from it. Scalable, server based architecture InfoExtractor is based on a scalable backend, capable of processing large amounts of files. For a high throughput, it can be scaled both horizontally and vertically with additional processing resources. The maximum horizontal scalability is 20 processing services.
  23. 1. A new information extraction task is created. The user creates an information extraction task using the ABBYY Compreno REST API. The Control Server chooses one of the available Processing Stations and allocates the task to it. The task is then sent to the Processing Station. 2. The document is converted into the SDK’s internal format. The Processing Station converts the document into an internal format. If any text in the document requires optical character recognition (OCR), the station uses a built-in component to recognize the text. Your license determines whether or not the OCR function is available. 3. The Processing Station performs a semantic analysis of the document. The analysis is performed by one of the executors. To increase performance, the document may be split into parts that can be processed by other executors and Processing Stations. 4. Data is extracted from the document. When the semantic analysis completes, information extraction rules are applied to its results. The installed Information Extraction Module determines which data extraction algorithms are applied and which entities and facts are extracted. 5. The information extraction results are saved. The extracted entities and facts are saved to an RDF/XML file and this file is sent to the Control Server. 6. The task is completed. The Control Server receives the RDF/XML file from the Processing Station and flags the task as completed. The user can now access the extracted entities and facts via the REST API.
  24. Intelligence and insights ABBYY InfoExtractor SDK takes data analysis to an entirely new level, allowing companies to take advantage of the critical facts and story lines that are, literally, right in front of their eyes. They can now harvest the true value of their information while reducing manual efforts, streamlining processes and making more informed decisions based on a deeper, context-based understanding of the data. Knowledge workers navigate directly to the relevant facts and easily retrieve the exact information they need and spend less time on searching or manual content upload. Aid predictive decision-making The intelligence and insights InfoExtractor provides enable business professionals to take critical decisions faster. Intelligent text analysis algorithms deliver predictable results, liberating from potential human-related mistakes. However, when it comes down to taking critical decisions, it is crucial to ensure the consistency and legitimacy of information extraction. Configurable confidence scores allow to define which results should go through human validation to ensure that no piece of business-critical information is lost. Uncover hidden risks Connect entities, facts and events across documents to get the big picture of relationships between persons or organizations mentioned in various pieces of content. Manage obligations across numerous contracts, leading to more control over the possible risks. Cost efficiency InfoExtractor allows companies to accelerate and automate content upload and analysis to optimize manual processes and therefore stay competitive with faster serving & on-boarding customers. Accelerate analysis of unstructured documents, including initial documents required for verifying new customers or transaction–related documents required for transaction legitimacy check. Customers are enrolled and receive their services faster, bringing businesses higher revenues and building reputation.
  25. Smart Classifier supports enterprises in accessing unstructured information, turning it into an asset and using it to their advantage. No special skills required - content and process experts can set up and maintain classification. InfoExtractor extracts critical information from unstructured data powering business tasks that require granular content analysis and understanding. Good classification and information extraction let organisations solve tasks they are not capable of solving at the moment. Based upon Compreno, Smart Classifier and InfoExtractor both have an innovative approach and are not domain specific but can be applied in a variety of information and content management scenarios within the entire scope of an enterprise.