Catégorisation automatisée de contenus documentaires : la ...

GammaWare Technology

June 2002

Yiftach Ravid, VP R&D
GammaSite Inc.

yiftach@GammaSite.com

1

Overview

- The challenge

- Taxonomies

- Classification

- Categorization

- Focused Crawler

- Q&A

2

The challenge: Generate Structured
Taxonomies of text repositories

Internal DB
Information
Word
Application
Web Forms
XML
Services
Catalogues Mail Domino

3  Generate a structured taxonomy of huge text repositories

What is a Taxonomy

 Taxonomy
 Taxis = arrangement or division
 Nomos = law

 The science of classification according to a pre-
determined system

 Best-known use of taxonomy is in Biology
 taxonomies of animals and plants

5

Web Taxonomy

 Best-known use of taxonomies:
 Web portals or Directories

 Internet sites classified into hierarchical topics

General:
• Yahoo! http://www.yahoo.com/

• Open Directory http://www.dmoz.org/

• LookSmart http://www.looksmart.com/r?country=uk

 Topical:
• Business.Com http://www.business.com/

• HealthWeb http://www.healthweb.org/

• Education Planet http://www.educationplanet.com/

6

Taxonomy vs. Thesaurus

Criteria Taxonomy Thesaurus
Focus Documents and their organization Terms used in the organization

Usage Classification of documents Indexing documents
 Classified into categories/terms  Terms are attached to documents

Retrieval Mainly browsing Keyword queries

Size Restricted to the necessary terms sizes is very large (Terms may be
added freely)

8

What is a Classifier

Concept (Topic, Subject):
 An abstract or generic idea generalized from particular
instances [Merriam Webster]

Classifier:
 A function on a concept (category) and on an object
(document)
 Returns a number between 0 and 1 called confidence
rate
 Confidence rate: measuring the confidence that the
object (document) belongs (should be classified) to the
concept (category)

10

Methods for Automatic Classification

 Rule based
 Pre-defined set of rules
 Advantage
• incorporating prior knowledge
 Disadvantages:
• extreme reliance on man-made rules
• costly in terms of man-hours

 Linguistics
 Use of morphology, syntax and semantics
 Not Multi lingual, demands many training examples

 Machine Learning

11

What is Machine Learning

Machine Learning is the study of
computer algorithms that
automatically improve
performance through
“experience”

12

Sample for Machine Learning

DOGS CATS

13

Discriminating Features

Q1: Who is this person?
Q2: What are the most
discriminating features?

14


Answer:
 Lips

 Eyes

15


The “Margaret Thatcher effect”

16

Supervised Inductive Learning

 A process where:

 A learning algorithm is provided with a set of labeled
instances, positive and negative examples (a training
set)

 Using the training set the leaning algorithm generates a
classifier

 The quality of the classifier is measured via its ability to
perform well on novel instances (a test set)

17

Supervised Inductive Learning Example

Training

Test

errors

correct

18

Evaluating a Classifier

Category Classifier

19

Recall and Precision

Use a confusion matrix to count
True Label
Yes No Total
Good 70 50 120
Classified
Bad 30 150 180
Total 100 200 300

Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58

Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70

Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73

F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63

20

Supervised Statistical Machine Learning

 A Supervised Inductive Learning method that is based
on statistics obtained from the training set

 Benefits
 Generality and flexibility

• Successfully applied across a broad spectrum of
problems

 Multi lingual

 Low labor costs

21

How to Classify documents

 Pre defined fields ( Structured data )
 Author

 Title

 Date

 Content ( Unstructured data )
 From title, main text, emphasized text

 All words

 All 2 words, All 3 words, etc.

 Phrases, Synonyms, etc.

22

GammaWare Work Flow

Requirements
Ready

Design the Improve Classifiers
Taxonomy

Seeding Catalogue
Process Documents

Train
Check Seed Classifiers

24

Requirements

 Initial parameters and decisions:
 Level of percolation - affects:
• Recall
• Precision
 Multi label
• Maximum number of categories into which a
document can be classified
 Types of training documents
• Full text, Keywords
• Different types per category
 List of Stop Words
• Common words in the used language and also
in topic

25

Taxonomy

 A Taxonomy is constructed according to:
 UserBusiness needs
• who will be using the taxonomy

 Data
• content of documents for classification

 Good taxonomy:
 requires critical attention to both the definition and
application of categories and their labels
 simple and intuitive

 How: Using the Expert Tool

26

Seeding process

 Seeding process: each category within the taxonomy
needs to be given a few examples of relevant
documents of the same type that the user seeks to
catalog
 An average of 3-6 relevant documents per category

 Seeds can either be “positive seeds” or “negative
seeds” for each category

 For better results - training documents should be in a
similar structure as the documents for classification

 How: Using the Expert Tool

27

Check Seed

 Check seed: Classify the seeds
into the taxonomy
 Output: An HTML page (browsed
by the Expert tool)
 For each category shows the
cataloging results for all the
relevant seeds.
 Why: Help in locating seeding
problems:
 Seeds that are multi labeled
 Problems in taxonomy
structure
 How: Using the GammaWare
Manager

28

Train Classifiers

 Train: Train classifiers for all categories

 Output: A classifier file (gcl extension) for
each category

 Why: The classifiers are used for
categorization.

 How: Using the GammaWare Manager

29

Classify Documents

 Categorization: Catalogue documents into a
Taxonomy

 Output: A table in a database

 Why: This is why we are here.

 How: Using the GammaWare Manager

30

Improve Classifiers

 Methods to improve classification results using the
Expert Tool.

 Re-design the taxonomy
 Seed problems
• More examples

• Add new seeds

• drag and drop documents from
classification view
• Negative “seeds”

 Modify Categorization and Train parameters

31

Hierarchical Categorization

 Goal: Classify a document into the
appropriate sub-topic(s) in the taxonomy

 Difficulties:
 Many sub-topics

 A document may fall into several sub-
topics
 Classifiers are not perfect

 Must control “Recall” and “Precision”
according to the client’s needs

33

Hierarchical Categorization

 Divide and Conquer solution:
 Solve the problem Level by Level

 At each level decompose the problem into
several, smaller sized classification sub-
problems

 Note: ignoring interactions between sub-
problems can yield poor results

34  Patent Pending on Categorization

Topic Specific Crawling

 Retrieve all documents that
are relevant to a specific
topic of interest

 Hyper-linked networks (Intranet, Internet)
 Two options:
• Crawl the network. Then apply classification
schemes to filter relevant documents.
• Using classification schemes crawl the
network while teaching the crawler to
imitate (intelligent) human surfing strategies

36

Simple Crawling

 The Network is huge
 Storage

 Network
Starting
 Time
Document
 Good for general-purpose
search engines

 Crawling: The process of retrieving documents from the net
37

Focused Crawling via Link Classifiers

 Analyze the context of the
link

Herbal tea
specialist Link Classifier Retrieve the URL

My brother new
Link Classifier Link is irrelevant
born child

38  Link classifier: Decision according to the context of the link

Focused Crawler – The Learning Process

Retrieve the
content of the
Herbal tea
link
specialist Link Classifier

Send acknowledgment
to the “link classifier” - Crawler
Learning Process Classifier

39  Crawler Classifier: Checks if the document is good for
Crawling

Architecture - Basic

Proxy Client GammaWare
CORBA
GammaWare Proxy

CORBA
API GammaWare
Software GW File
System
Customer
Client ODBC

Relational Web
Database

File Relational
Database Outlook Notes File Document
System
Domino System Management

41

Multiple Servers

GammaWare
Proxy
GammaWare
Proxy
GammaWare
Database
Server 4
GammaWare
Database
Server 3
GammaWare
Server 2
GammaWare
Server
Client

42  Scalability and Availability

Catégorisation automatisée de contenus documentaires : la ...

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Catégorisation automatisée de contenus documentaires : la ...

Ähnlich wie Catégorisation automatisée de contenus documentaires : la ... (20)

Mehr von butest

Mehr von butest (20)

Catégorisation automatisée de contenus documentaires : la ...