SlideShare ist ein Scribd-Unternehmen logo
1 von 43
GammaWare Technology

            June 2002

        Yiftach Ravid, VP R&D
            GammaSite Inc.

       yiftach@GammaSite.com


1
Overview

    - The challenge

        - Taxonomies

         - Classification

          - Categorization

          - Focused Crawler

         - Q&A


2
The challenge: Generate Structured
      Taxonomies of text repositories




                                                            Internal DB
                                                                           Information
                                                            Word
                                                                           Application
                                                      Web      Forms
                                           XML
                                                                          Services
                                  Catalogues   Mail    Domino



3    Generate a structured taxonomy of huge text repositories
Taxonomy




4
What is a Taxonomy

         Taxonomy
            Taxis       = arrangement or division
            Nomos       = law

         The science of classification according to a pre-
          determined system

         Best-known use of taxonomy is in Biology
            taxonomies of animals and plants




5
Web Taxonomy

         Best-known use of taxonomies:
            Web portals or Directories

            Internet sites classified into hierarchical topics



               General:
                • Yahoo! http://www.yahoo.com/

                • Open Directory http://www.dmoz.org/

                • LookSmart http://www.looksmart.com/r?country=uk

              Topical:
                • Business.Com http://www.business.com/

                • HealthWeb http://www.healthweb.org/

                • Education Planet http://www.educationplanet.com/




6
Taxonomy - Sample




7
Taxonomy vs. Thesaurus


    Criteria               Taxonomy                             Thesaurus
    Focus       Documents and their organization     Terms used in the organization



    Usage       Classification of documents          Indexing documents
                 Classified into categories/terms    Terms are attached to documents


    Retrieval   Mainly browsing                      Keyword queries



    Size        Restricted to the necessary terms    sizes is very large (Terms may be
                                                     added freely)




8
Classification




9
What is a Classifier

       Concept (Topic, Subject):
        An abstract or generic idea generalized from particular
          instances [Merriam Webster]

       Classifier:
        A function on a concept (category) and on an object
          (document)
        Returns a number between 0 and 1 called confidence
          rate
        Confidence rate: measuring the confidence that the
          object (document) belongs (should be classified) to the
          concept (category)



10
Methods for Automatic Classification

          Rule based
             Pre-defined set of rules
             Advantage
                 • incorporating prior knowledge
             Disadvantages:
                 • extreme reliance on man-made rules
                 • costly in terms of man-hours


          Linguistics
             Use of morphology, syntax and semantics
             Not Multi lingual, demands many training examples


          Machine Learning


11
What is Machine Learning


        Machine Learning is the study of
            computer algorithms that
             automatically improve
              performance through
                  “experience”




12
Sample for Machine Learning




         DOGS                      CATS


13
Discriminating Features


       Q1: Who is this person?
       Q2: What are the most
         discriminating features?




14
Discriminating Features


       Answer:
          Lips

          Eyes




15
Discriminating Features



     The “Margaret Thatcher effect”




16
Supervised Inductive Learning

          A process where:

          A learning algorithm is provided with a set of labeled
           instances, positive and negative examples (a training
           set)

          Using the training set the leaning algorithm generates a
           classifier

          The quality of the classifier is measured via its ability to
           perform well on novel instances (a test set)




17
Supervised Inductive Learning Example


     Training




     Test




        errors



        correct




18
Evaluating a Classifier


       Category            Classifier




19
Recall and Precision

     Use a confusion matrix to count
                                           True Label
                                          Yes      No       Total
                                 Good      70       50      120
                    Classified
                                 Bad       30      150      180
                    Total                 100      200      300

     Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58

     Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70

     Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73

     F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63


20
Supervised Statistical Machine Learning

          A Supervised Inductive Learning method that is based
           on statistics obtained from the training set

          Benefits
             Generality and flexibility

                • Successfully applied across a broad spectrum of
                    problems

               Multi lingual

               Low labor costs




21
How to Classify documents

          Pre defined fields ( Structured data )
             Author

             Title

             Date



          Content ( Unstructured data )
             From title, main text, emphasized text

             All words

             All 2 words, All 3 words, etc.

             Phrases, Synonyms, etc.




22
Getting Started




23
GammaWare Work Flow

         Requirements
                                              Ready

          Design the    Improve Classifiers
          Taxonomy

           Seeding         Catalogue
           Process         Documents

                              Train
         Check Seed         Classifiers




24
Requirements

          Initial parameters and decisions:
             Level of percolation - affects:
                   • Recall
                   • Precision
             Multi label
                   • Maximum number of categories into which a
                     document can be classified
             Types of training documents
                   • Full text, Keywords
                   • Different types per category
             List of Stop Words
                   • Common words in the used language and also
                     in topic



25
Taxonomy


         A Taxonomy is constructed according to:
              UserBusiness needs
                • who will be using the taxonomy

              Data
                • content of documents for classification



         Good taxonomy:
            requires critical attention to both the definition and
             application of categories and their labels
            simple and intuitive



         How: Using the Expert Tool


26
Seeding process

          Seeding process: each category within the taxonomy
           needs to be given a few examples of relevant
           documents of the same type that the user seeks to
           catalog
             An average of 3-6 relevant documents per category

             Seeds can either be “positive seeds” or “negative
               seeds” for each category

          For better results - training documents should be in a
           similar structure as the documents for classification

          How: Using the Expert Tool



27
Check Seed

    Check seed: Classify the seeds
     into the taxonomy
    Output: An HTML page (browsed
     by the Expert tool)
        For each category shows the
          cataloging results for all the
          relevant seeds.
    Why: Help in locating seeding
     problems:
        Seeds that are multi labeled
        Problems in taxonomy
          structure
    How: Using the GammaWare
     Manager



28
Train Classifiers


          Train: Train classifiers for all categories

          Output: A classifier file (gcl extension) for
           each category

          Why: The classifiers are used for
           categorization.

          How: Using the GammaWare Manager



29
Classify Documents


          Categorization: Catalogue documents into a
           Taxonomy

          Output: A table in a database

          Why: This is why we are here.

          How: Using the GammaWare Manager




30
Improve Classifiers

          Methods to improve classification results using the
           Expert Tool.

               Re-design the taxonomy
               Seed problems
                 • More examples

                 • Add new seeds

                      • drag and drop documents from
                        classification view
                 • Negative “seeds”



               Modify Categorization and Train parameters



31
Categorization




32
Hierarchical Categorization


                 Goal: Classify a document into the
                  appropriate sub-topic(s) in the taxonomy

                 Difficulties:
                    Many sub-topics

                    A document may fall into several sub-
                     topics
                    Classifiers are not perfect

                    Must control “Recall” and “Precision”
                     according to the client’s needs

33
Hierarchical Categorization


                      Divide and Conquer solution:
                         Solve the problem Level by Level

                         At each level decompose the problem into
                           several, smaller sized classification sub-
                           problems

                           Note: ignoring interactions between sub-
                            problems can yield poor results




34    Patent Pending on Categorization
Focused Crawler




35
Topic Specific Crawling

              Retrieve all documents that
               are relevant to a specific
               topic of interest

                  Hyper-linked networks (Intranet, Internet)
                  Two options:
                    • Crawl the network. Then apply classification
                      schemes to filter relevant documents.
                    • Using classification schemes crawl the
                      network while teaching the crawler to
                      imitate (intelligent) human surfing strategies


36
Simple Crawling




                                                   The Network is huge
                                                      Storage

                                                   Network
 Starting
                                                      Time
Document
                                                   Good for general-purpose
                                                    search engines

       Crawling: The process of retrieving documents from the net
37
Focused Crawling via Link Classifiers

                           Analyze the context of the
                            link

     Herbal tea
     specialist                  Link Classifier                  Retrieve the URL




     My brother new
                                  Link Classifier   Link is irrelevant
     born child




38       Link classifier: Decision according to the context of the link
Focused Crawler – The Learning Process


                                                  Retrieve the
                                                  content of the
     Herbal tea
                                                  link
     specialist                 Link Classifier




                                   Send acknowledgment
                                   to the “link classifier” -      Crawler
                                   Learning Process                Classifier



39       Crawler Classifier: Checks if the document is good for
        Crawling
GammaWare API




40
Architecture - Basic

                                      Proxy Client                  GammaWare
                CORBA
 GammaWare                                                            Proxy




                             CORBA
    API                              GammaWare
                                      Software            GW File
                                                          System
     Customer
      Client                            ODBC


                                     Relational                           Web
                                     Database


 File           Relational
                Database                     Outlook    Notes    File     Document
 System
                                                       Domino   System   Management



41
Multiple Servers

                                           GammaWare
                                             Proxy
                      GammaWare
                        Proxy
                                                    GammaWare
     Database
                                                      Server 4
                                          GammaWare
       Database
                                            Server 3
                               GammaWare
                                 Server 2
                      GammaWare
                        Server
                                                               Client



42       Scalability and Availability
Q&A




43

Weitere ähnliche Inhalte

Ähnlich wie Catégorisation automatisée de contenus documentaires : la ...

Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...butest
 
Dynamic Potential of Semantic Enrichment
Dynamic Potential of Semantic EnrichmentDynamic Potential of Semantic Enrichment
Dynamic Potential of Semantic Enrichmentpharley
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise SearchFindwise
 
Successful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata DesignSuccessful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata Designsarakirsten
 
Object oriented basics
Object oriented basicsObject oriented basics
Object oriented basicsvamshimahi
 
types of testing with descriptions and examples
types of testing with descriptions and examplestypes of testing with descriptions and examples
types of testing with descriptions and examplesMani Deepak Choudhry
 
chapter 5 Objectdesign.ppt
chapter 5 Objectdesign.pptchapter 5 Objectdesign.ppt
chapter 5 Objectdesign.pptTemesgenAzezew
 
Introduction to Taxonomy Development - by Clobridge Consulting
Introduction to Taxonomy Development - by Clobridge ConsultingIntroduction to Taxonomy Development - by Clobridge Consulting
Introduction to Taxonomy Development - by Clobridge ConsultingAbby Clobridge
 
Tna how taxonomy applications were built
Tna how taxonomy applications were builtTna how taxonomy applications were built
Tna how taxonomy applications were builtJeremie Charlet
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine LearningSharjeel Imtiaz
 
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...Andrea Resmini
 
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
Object Oriented Programming C#
Object Oriented Programming C#Object Oriented Programming C#
Object Oriented Programming C#Muhammad Younis
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 

Ähnlich wie Catégorisation automatisée de contenus documentaires : la ... (20)

Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...
 
Dynamic Potential of Semantic Enrichment
Dynamic Potential of Semantic EnrichmentDynamic Potential of Semantic Enrichment
Dynamic Potential of Semantic Enrichment
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Taxonomy Quality Assessment
Taxonomy Quality AssessmentTaxonomy Quality Assessment
Taxonomy Quality Assessment
 
Successful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata DesignSuccessful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata Design
 
Aiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cgAiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cg
 
Object oriented basics
Object oriented basicsObject oriented basics
Object oriented basics
 
types of testing with descriptions and examples
types of testing with descriptions and examplestypes of testing with descriptions and examples
types of testing with descriptions and examples
 
chapter 5 Objectdesign.ppt
chapter 5 Objectdesign.pptchapter 5 Objectdesign.ppt
chapter 5 Objectdesign.ppt
 
Introduction to Taxonomy Development - by Clobridge Consulting
Introduction to Taxonomy Development - by Clobridge ConsultingIntroduction to Taxonomy Development - by Clobridge Consulting
Introduction to Taxonomy Development - by Clobridge Consulting
 
Tna how taxonomy applications were built
Tna how taxonomy applications were builtTna how taxonomy applications were built
Tna how taxonomy applications were built
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
Testing Taxonomies
Testing TaxonomiesTesting Taxonomies
Testing Taxonomies
 
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...
FaceTag: Integrating Bottom-up and Top-down Classification in a Social Taggin...
 
Dissertation literature search
Dissertation literature searchDissertation literature search
Dissertation literature search
 
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
Object Oriented Programming C#
Object Oriented Programming C#Object Oriented Programming C#
Object Oriented Programming C#
 
Refactoring
RefactoringRefactoring
Refactoring
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Catégorisation automatisée de contenus documentaires : la ...

  • 1. GammaWare Technology June 2002 Yiftach Ravid, VP R&D GammaSite Inc. yiftach@GammaSite.com 1
  • 2. Overview - The challenge - Taxonomies - Classification - Categorization - Focused Crawler - Q&A 2
  • 3. The challenge: Generate Structured Taxonomies of text repositories Internal DB Information Word Application Web Forms XML Services Catalogues Mail Domino 3  Generate a structured taxonomy of huge text repositories
  • 5. What is a Taxonomy  Taxonomy  Taxis = arrangement or division  Nomos = law  The science of classification according to a pre- determined system  Best-known use of taxonomy is in Biology  taxonomies of animals and plants 5
  • 6. Web Taxonomy  Best-known use of taxonomies:  Web portals or Directories  Internet sites classified into hierarchical topics General: • Yahoo! http://www.yahoo.com/ • Open Directory http://www.dmoz.org/ • LookSmart http://www.looksmart.com/r?country=uk  Topical: • Business.Com http://www.business.com/ • HealthWeb http://www.healthweb.org/ • Education Planet http://www.educationplanet.com/ 6
  • 8. Taxonomy vs. Thesaurus Criteria Taxonomy Thesaurus Focus Documents and their organization Terms used in the organization Usage Classification of documents Indexing documents  Classified into categories/terms  Terms are attached to documents Retrieval Mainly browsing Keyword queries Size Restricted to the necessary terms sizes is very large (Terms may be added freely) 8
  • 10. What is a Classifier Concept (Topic, Subject):  An abstract or generic idea generalized from particular instances [Merriam Webster] Classifier:  A function on a concept (category) and on an object (document)  Returns a number between 0 and 1 called confidence rate  Confidence rate: measuring the confidence that the object (document) belongs (should be classified) to the concept (category) 10
  • 11. Methods for Automatic Classification  Rule based  Pre-defined set of rules  Advantage • incorporating prior knowledge  Disadvantages: • extreme reliance on man-made rules • costly in terms of man-hours  Linguistics  Use of morphology, syntax and semantics  Not Multi lingual, demands many training examples  Machine Learning 11
  • 12. What is Machine Learning Machine Learning is the study of computer algorithms that automatically improve performance through “experience” 12
  • 13. Sample for Machine Learning DOGS CATS 13
  • 14. Discriminating Features Q1: Who is this person? Q2: What are the most discriminating features? 14
  • 15. Discriminating Features Answer:  Lips  Eyes 15
  • 16. Discriminating Features The “Margaret Thatcher effect” 16
  • 17. Supervised Inductive Learning  A process where:  A learning algorithm is provided with a set of labeled instances, positive and negative examples (a training set)  Using the training set the leaning algorithm generates a classifier  The quality of the classifier is measured via its ability to perform well on novel instances (a test set) 17
  • 18. Supervised Inductive Learning Example Training Test errors correct 18
  • 19. Evaluating a Classifier Category Classifier 19
  • 20. Recall and Precision Use a confusion matrix to count True Label Yes No Total Good 70 50 120 Classified Bad 30 150 180 Total 100 200 300 Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58 Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70 Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73 F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63 20
  • 21. Supervised Statistical Machine Learning  A Supervised Inductive Learning method that is based on statistics obtained from the training set  Benefits  Generality and flexibility • Successfully applied across a broad spectrum of problems  Multi lingual  Low labor costs 21
  • 22. How to Classify documents  Pre defined fields ( Structured data )  Author  Title  Date  Content ( Unstructured data )  From title, main text, emphasized text  All words  All 2 words, All 3 words, etc.  Phrases, Synonyms, etc. 22
  • 24. GammaWare Work Flow Requirements Ready Design the Improve Classifiers Taxonomy Seeding Catalogue Process Documents Train Check Seed Classifiers 24
  • 25. Requirements  Initial parameters and decisions:  Level of percolation - affects: • Recall • Precision  Multi label • Maximum number of categories into which a document can be classified  Types of training documents • Full text, Keywords • Different types per category  List of Stop Words • Common words in the used language and also in topic 25
  • 26. Taxonomy  A Taxonomy is constructed according to:  UserBusiness needs • who will be using the taxonomy  Data • content of documents for classification  Good taxonomy:  requires critical attention to both the definition and application of categories and their labels  simple and intuitive  How: Using the Expert Tool 26
  • 27. Seeding process  Seeding process: each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog  An average of 3-6 relevant documents per category  Seeds can either be “positive seeds” or “negative seeds” for each category  For better results - training documents should be in a similar structure as the documents for classification  How: Using the Expert Tool 27
  • 28. Check Seed  Check seed: Classify the seeds into the taxonomy  Output: An HTML page (browsed by the Expert tool)  For each category shows the cataloging results for all the relevant seeds.  Why: Help in locating seeding problems:  Seeds that are multi labeled  Problems in taxonomy structure  How: Using the GammaWare Manager 28
  • 29. Train Classifiers  Train: Train classifiers for all categories  Output: A classifier file (gcl extension) for each category  Why: The classifiers are used for categorization.  How: Using the GammaWare Manager 29
  • 30. Classify Documents  Categorization: Catalogue documents into a Taxonomy  Output: A table in a database  Why: This is why we are here.  How: Using the GammaWare Manager 30
  • 31. Improve Classifiers  Methods to improve classification results using the Expert Tool.  Re-design the taxonomy  Seed problems • More examples • Add new seeds • drag and drop documents from classification view • Negative “seeds”  Modify Categorization and Train parameters 31
  • 33. Hierarchical Categorization  Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy  Difficulties:  Many sub-topics  A document may fall into several sub- topics  Classifiers are not perfect  Must control “Recall” and “Precision” according to the client’s needs 33
  • 34. Hierarchical Categorization  Divide and Conquer solution:  Solve the problem Level by Level  At each level decompose the problem into several, smaller sized classification sub- problems  Note: ignoring interactions between sub- problems can yield poor results 34  Patent Pending on Categorization
  • 36. Topic Specific Crawling  Retrieve all documents that are relevant to a specific topic of interest  Hyper-linked networks (Intranet, Internet)  Two options: • Crawl the network. Then apply classification schemes to filter relevant documents. • Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies 36
  • 37. Simple Crawling  The Network is huge  Storage  Network Starting  Time Document  Good for general-purpose search engines  Crawling: The process of retrieving documents from the net 37
  • 38. Focused Crawling via Link Classifiers  Analyze the context of the link Herbal tea specialist Link Classifier Retrieve the URL My brother new Link Classifier Link is irrelevant born child 38  Link classifier: Decision according to the context of the link
  • 39. Focused Crawler – The Learning Process Retrieve the content of the Herbal tea link specialist Link Classifier Send acknowledgment to the “link classifier” - Crawler Learning Process Classifier 39  Crawler Classifier: Checks if the document is good for Crawling
  • 41. Architecture - Basic Proxy Client GammaWare CORBA GammaWare Proxy CORBA API GammaWare Software GW File System Customer Client ODBC Relational Web Database File Relational Database Outlook Notes File Document System Domino System Management 41
  • 42. Multiple Servers GammaWare Proxy GammaWare Proxy GammaWare Database Server 4 GammaWare Database Server 3 GammaWare Server 2 GammaWare Server Client 42  Scalability and Availability