3. The challenge: Generate Structured
Taxonomies of text repositories
Internal DB
Information
Word
Application
Web Forms
XML
Services
Catalogues Mail Domino
3 Generate a structured taxonomy of huge text repositories
5. What is a Taxonomy
Taxonomy
Taxis = arrangement or division
Nomos = law
The science of classification according to a pre-
determined system
Best-known use of taxonomy is in Biology
taxonomies of animals and plants
5
6. Web Taxonomy
Best-known use of taxonomies:
Web portals or Directories
Internet sites classified into hierarchical topics
General:
• Yahoo! http://www.yahoo.com/
• Open Directory http://www.dmoz.org/
• LookSmart http://www.looksmart.com/r?country=uk
Topical:
• Business.Com http://www.business.com/
• HealthWeb http://www.healthweb.org/
• Education Planet http://www.educationplanet.com/
6
8. Taxonomy vs. Thesaurus
Criteria Taxonomy Thesaurus
Focus Documents and their organization Terms used in the organization
Usage Classification of documents Indexing documents
Classified into categories/terms Terms are attached to documents
Retrieval Mainly browsing Keyword queries
Size Restricted to the necessary terms sizes is very large (Terms may be
added freely)
8
10. What is a Classifier
Concept (Topic, Subject):
An abstract or generic idea generalized from particular
instances [Merriam Webster]
Classifier:
A function on a concept (category) and on an object
(document)
Returns a number between 0 and 1 called confidence
rate
Confidence rate: measuring the confidence that the
object (document) belongs (should be classified) to the
concept (category)
10
11. Methods for Automatic Classification
Rule based
Pre-defined set of rules
Advantage
• incorporating prior knowledge
Disadvantages:
• extreme reliance on man-made rules
• costly in terms of man-hours
Linguistics
Use of morphology, syntax and semantics
Not Multi lingual, demands many training examples
Machine Learning
11
12. What is Machine Learning
Machine Learning is the study of
computer algorithms that
automatically improve
performance through
“experience”
12
17. Supervised Inductive Learning
A process where:
A learning algorithm is provided with a set of labeled
instances, positive and negative examples (a training
set)
Using the training set the leaning algorithm generates a
classifier
The quality of the classifier is measured via its ability to
perform well on novel instances (a test set)
17
20. Recall and Precision
Use a confusion matrix to count
True Label
Yes No Total
Good 70 50 120
Classified
Bad 30 150 180
Total 100 200 300
Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58
Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70
Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73
F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63
20
21. Supervised Statistical Machine Learning
A Supervised Inductive Learning method that is based
on statistics obtained from the training set
Benefits
Generality and flexibility
• Successfully applied across a broad spectrum of
problems
Multi lingual
Low labor costs
21
22. How to Classify documents
Pre defined fields ( Structured data )
Author
Title
Date
Content ( Unstructured data )
From title, main text, emphasized text
All words
All 2 words, All 3 words, etc.
Phrases, Synonyms, etc.
22
24. GammaWare Work Flow
Requirements
Ready
Design the Improve Classifiers
Taxonomy
Seeding Catalogue
Process Documents
Train
Check Seed Classifiers
24
25. Requirements
Initial parameters and decisions:
Level of percolation - affects:
• Recall
• Precision
Multi label
• Maximum number of categories into which a
document can be classified
Types of training documents
• Full text, Keywords
• Different types per category
List of Stop Words
• Common words in the used language and also
in topic
25
26. Taxonomy
A Taxonomy is constructed according to:
UserBusiness needs
• who will be using the taxonomy
Data
• content of documents for classification
Good taxonomy:
requires critical attention to both the definition and
application of categories and their labels
simple and intuitive
How: Using the Expert Tool
26
27. Seeding process
Seeding process: each category within the taxonomy
needs to be given a few examples of relevant
documents of the same type that the user seeks to
catalog
An average of 3-6 relevant documents per category
Seeds can either be “positive seeds” or “negative
seeds” for each category
For better results - training documents should be in a
similar structure as the documents for classification
How: Using the Expert Tool
27
28. Check Seed
Check seed: Classify the seeds
into the taxonomy
Output: An HTML page (browsed
by the Expert tool)
For each category shows the
cataloging results for all the
relevant seeds.
Why: Help in locating seeding
problems:
Seeds that are multi labeled
Problems in taxonomy
structure
How: Using the GammaWare
Manager
28
29. Train Classifiers
Train: Train classifiers for all categories
Output: A classifier file (gcl extension) for
each category
Why: The classifiers are used for
categorization.
How: Using the GammaWare Manager
29
30. Classify Documents
Categorization: Catalogue documents into a
Taxonomy
Output: A table in a database
Why: This is why we are here.
How: Using the GammaWare Manager
30
31. Improve Classifiers
Methods to improve classification results using the
Expert Tool.
Re-design the taxonomy
Seed problems
• More examples
• Add new seeds
• drag and drop documents from
classification view
• Negative “seeds”
Modify Categorization and Train parameters
31
33. Hierarchical Categorization
Goal: Classify a document into the
appropriate sub-topic(s) in the taxonomy
Difficulties:
Many sub-topics
A document may fall into several sub-
topics
Classifiers are not perfect
Must control “Recall” and “Precision”
according to the client’s needs
33
34. Hierarchical Categorization
Divide and Conquer solution:
Solve the problem Level by Level
At each level decompose the problem into
several, smaller sized classification sub-
problems
Note: ignoring interactions between sub-
problems can yield poor results
34 Patent Pending on Categorization
36. Topic Specific Crawling
Retrieve all documents that
are relevant to a specific
topic of interest
Hyper-linked networks (Intranet, Internet)
Two options:
• Crawl the network. Then apply classification
schemes to filter relevant documents.
• Using classification schemes crawl the
network while teaching the crawler to
imitate (intelligent) human surfing strategies
36
37. Simple Crawling
The Network is huge
Storage
Network
Starting
Time
Document
Good for general-purpose
search engines
Crawling: The process of retrieving documents from the net
37
38. Focused Crawling via Link Classifiers
Analyze the context of the
link
Herbal tea
specialist Link Classifier Retrieve the URL
My brother new
Link Classifier Link is irrelevant
born child
38 Link classifier: Decision according to the context of the link
39. Focused Crawler – The Learning Process
Retrieve the
content of the
Herbal tea
link
specialist Link Classifier
Send acknowledgment
to the “link classifier” - Crawler
Learning Process Classifier
39 Crawler Classifier: Checks if the document is good for
Crawling