SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
Fuzzy Combinations of Criteria: An Application to
     Web Page Representation for Clustering

  Alberto P´rez Garc´
           e        ıa-Plaza, V´
                               ıctor Fresno, Raquel Mart´
                                                        ınez

     NLP & IR Group, Distance Learning University (UNED)

                CICLing 2012, New Delhi, India

                        March 15, 2012
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  2 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary




      Motivation




      Main goal
      To understand how to represent web pages for clustering.

      Question
      How to combine different page features to represent web pages?




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             3 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary




      Motivation




      Main goal
      To understand how to represent web pages for clustering.

      Question
      How to combine different page features to represent web pages?




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             3 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  4 / 30
Motivation         Understanding the system                  Improving the Combination                   Summary




      Web Page Representation




      Hypothesis
      A good document representation should be based on how humans
      read documents.




                   Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                              5 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
      Criteria:




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             6 / 30
Motivation         Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
               §         ¤
      Criteria: ¦
                Title ¥




                   Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                              6 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
              §         ¤§                   ¤
      Criteria: ¦
                Title ¥Emphasis
                       ¦        ¥




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             6 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
      Word positions:




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             6 / 30
Motivation       Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
                       §                      ¤
      Word positions: ¦
                      Preferential ¥




                 Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                            6 / 30
Motivation       Understanding the system                  Improving the Combination                   Summary



      Different Criteria for Web Page
      Representation
                       §                      ¤§                   ¤
      Word positions: ¦
                      Preferential ¥Standard ¥
                                    ¦




                 Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                            6 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  7 / 30
Motivation                  Understanding the system                    Improving the Combination                     Summary




      Linear Combination of Criteria

      For example: Analytical Combination of Criteria (acc)1 .
      Importance of a term in a document:

                                       Ik = tk it + ek ie + fk if + pk ip                                           (1)

                      Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4                                            (2)


      Drawback
      The importance of a term in a component is calculated regardless
      the rest of the components.

           1
             V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf.
      Syst., 22(3):215–235, 2004.
                            Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                            8 / 30
Motivation                  Understanding the system                    Improving the Combination                     Summary




      Linear Combination of Criteria

      For example: Analytical Combination of Criteria (acc)1 .
      Importance of a term in a document:

                                       Ik = tk it + ek ie + fk if + pk ip                                           (1)

                      Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4                                            (2)


      Drawback
      The importance of a term in a component is calculated regardless
      the rest of the components.

           1
             V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf.
      Syst., 22(3):215–235, 2004.
                            Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                            8 / 30
Motivation    Understanding the system                 Improving the Combination                   Summary




      Example: acc




                                 Call to Arms




             Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                        9 / 30
Motivation    Understanding the system                 Improving the Combination                   Summary




      Example: acc
                                                  Example of rethoric title
                                                  “Call to arms” is the title of a
                                                  page that contains an article
                                                  about the new trades made by
                                                  New York Yankees baseball team
                                                  and how these trades affect to
                                                  Boston Red Sox, their main rival
                                                  in the Major League Baseball.

                                                  Drawback
                                                  Title terms are not related to
                                                  document topic.



             Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                        10 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  11 / 30
Motivation                      Understanding the system                   Improving the Combination                    Summary




      Nonlinear Combination of Criteria




                  Fuzzy Combination of Criteria (fcc)2 allows nonlinear
                  combinations of criteria.
                  It is possible to define related conditions.
                  It produces vectors within the VSM.




             2
                 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
      2003.
                                Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                              12 / 30
Motivation                      Understanding the system                   Improving the Combination                    Summary




      Nonlinear Combination of Criteria




                  Fuzzy Combination of Criteria (fcc)2 allows nonlinear
                  combinations of criteria.
                  It is possible to define related conditions.
                  It produces vectors within the VSM.




             2
                 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
      2003.
                                Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                              12 / 30
Motivation                      Understanding the system                   Improving the Combination                    Summary




      Nonlinear Combination of Criteria




                  Fuzzy Combination of Criteria (fcc)2 allows nonlinear
                  combinations of criteria.
                  It is possible to define related conditions.
                  It produces vectors within the VSM.




             2
                 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
      2003.
                                Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                              12 / 30
Motivation    Understanding the system                  Improving the Combination                   Summary




      Example: fcc
                                                   Example of rethoric title
                                                   Now, we can express that a term
                                                   should appear in the title and
                                                   emphasized to be considered
                                                   important.

                                                   Nonlinearity
                                                   Title terms can be considered not
                                                   important because they do not
                                                   appear in the rest of the text.




              Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                         13 / 30
Motivation    Understanding the system                  Improving the Combination                   Summary




      Example: fcc
                                                   Example of rethoric title
                                                   Now, we can express that a term
                                                   should appear in the title and
                                                   emphasized to be considered
                                                   important.

                                                   Nonlinearity
                                                   Title terms can be considered not
                                                   important because they do not
                                                   appear in the rest of the text.




              Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                         13 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      A quick glance at fcc




             Close to natural language.
             Knowledge base: defined by a set of IF-THEN rules.
             Rules are based on how humans read documents.




                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                14 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      A quick glance at fcc




             Close to natural language.
             Knowledge base: defined by a set of IF-THEN rules.
             Rules are based on how humans read documents.




                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                14 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      A quick glance at fcc




             Close to natural language.
             Knowledge base: defined by a set of IF-THEN rules.
             Rules are based on how humans read documents.




                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                14 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  15 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  16 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Basic Clustering Settings



             We remove stopwords, punctuation and suffixes (Porter’s
             algorithm).
             Clustering: Cluto-rbr with default parameters.
             Web page representations: tf-idf and fcc
             Dimension reduction techniques (100, 500, 1000, 2000 and
             5000 features): mft and lsi.
             Banksearch and Webkb.
             F-measure to evaluate clustering quality.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                17 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  18 / 30
Motivation         Understanding the system                  Improving the Combination                   Summary




      Dimension Reduction Analysis




      Hypothesis
      If lsi improves mft, then the weighting function is not able to find
      the most representative terms.




                   Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                              19 / 30
Motivation         Understanding the system                  Improving the Combination                   Summary




                                  Rep.                Avg.       S.D.
                                  Banksearch
                                  tf-idf mft         0,748       0,028
                                  tf-idf lsi         0,756       0,005
                                  fcc mft            0,756       0,019
                                  fcc lsi            0,769       0,011
                                  Webkb
                                  tf-idf mft         0,460       0,051
                                  tf-idf lsi         0,507       0,006
                                  fcc mft            0,469       0,009
                                  fcc lsi            0,466       0,011



      Conclusion
      The weighting function is not working as well as it could.




                   Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                              20 / 30
Motivation         Understanding the system                  Improving the Combination                   Summary




                                  Rep.                Avg.       S.D.
                                  Banksearch
                                  tf-idf mft         0,748       0,028
                                  tf-idf lsi         0,756       0,005
                                  fcc mft            0,756       0,019
                                  fcc lsi            0,769       0,011
                                  Webkb
                                  tf-idf mft         0,460       0,051
                                  tf-idf lsi         0,507       0,006
                                  fcc mft            0,469       0,009
                                  fcc lsi            0,466       0,011



      Conclusion
      Results for fcc in Webkb dataset are surprisingly bad.




                   Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                              20 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  21 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Results for Criteria Analysis


                   Rep.Dim.           100         500       1000        2000       5000
                   Banksearch
                   fcc mft            0,723      0,757       0,768      0,765       0,768
                   title              0,626      0,646       0,632      0,634       0,639
                   emphasis           0,586      0,671       0,674      0,685       0,693
                   frequency          0,689      0,715       0,720      0,724       0,731
                   position           0,310      0,525       0,538      0,599       0,608


             For Banksearch, fcc get always higher values than individual
             criteria, so the combination works better in all cases.
             Frequency seems to be the best among the individual criteria.



                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                22 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Results for Criteria Analysis

                    Rep.Dim.         100         500        1000       2000        5000
                    Webkb
                    fcc mft          0,453       0,472      0,475       0,468       0,475
                    title            0,432       0,433      0,404       0,488       0,479
                    emphasis         0,415       0,431      0,433       0,465       0,489
                    frequency        0,441       0,460      0,460       0,468       0,446
                    position         0,301       0,283      0,317       0,281       0,286


             For Webkb, fcc does not always outperform the others.
             Frequency is not always the best among the individual criteria.
             When title and emphasis could lead to a better clustering, the
             combination get worse.


                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                23 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  24 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Improving the Combination




             Frequency should influence the decision more than position.

        IF   Title   AND    Frequency      AND      Emphasis      AND      Position       THEN       Importance
             Low            Medium                  Low                    Preferential   ⇒          Low
             Low            Medium                  Low                    Standard       ⇒          No




                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  25 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Extended Fuzzy Combination of Criteria (efcc)



        IF   Title   AND    Frequency      AND      Emphasis      AND      Position       THEN       Importance
             High                                   High                                  ⇒          Very High
             High                                   Medium                 Preferential   ⇒          High
             High                                   Medium                 Standard       ⇒          Medium
             High                                   Low                    Preferential   ⇒          Medium
             High                                   Low                    Standard       ⇒          Low
             Low                                    High                   Preferential   ⇒          High
             Low                                    High                   Standard       ⇒          Medium
             Low                                    Medium                 Preferential   ⇒          Medium
             Low                                    Medium                 Standard       ⇒          Low
             Low                                    Low                    Preferential   ⇒          Low
             Low                                    Low                    Standard       ⇒          No
                            High                                                          ⇒          Very High
                            Medium                                                        ⇒          Medium
                            Low                                                           ⇒          No




                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  26 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary




      System Comparison


      With efcc, both reduction methods get similar results.

                                 Rep.                Avg.       S.D.
                                 Banksearch
                                 tf-idf lsi         0,756       0,005
                                 fcc lsi            0,769       0,011
                                 efcc mft           0,760       0,014
                                 efcc lsi           0,758       0,013
                                 Webkb
                                 tf-idf lsi         0,507       0,006
                                 fcc mft            0,469       0,009
                                 efcc mft           0,532       0,032
                                 efcc lsi           0,483       0,000




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             27 / 30
Motivation        Understanding the system                  Improving the Combination                   Summary




      System Comparison


      efcc solves the problems of fcc in Webkb.

                                 Rep.                Avg.       S.D.
                                 Banksearch
                                 tf-idf lsi         0,756       0,005
                                 fcc lsi            0,769       0,011
                                 efcc mft           0,760       0,014
                                 efcc lsi           0,758       0,013
                                 Webkb
                                 tf-idf lsi         0,507       0,006
                                 fcc mft            0,469       0,009
                                 efcc mft           0,532       0,032
                                 efcc lsi           0,483       0,000




                  Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                             27 / 30
Motivation             Understanding the system                  Improving the Combination                   Summary




      Table of Contents

       1     Motivation
              Web Page Representation
              Linear Combination of Criteria
              Nonlinear Combination of Criteria
       2     Understanding the system
              Experimental Settings
              Dimension Reduction Analysis
              Study of Individual Criteria
       3     Improving the Combination
       4     Summary

                       Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                  28 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation           Understanding the system                  Improving the Combination                   Summary




      Summary

             We present a term weighting function based on how human
             read documents.
             The representation is not oriented to concrete sets of web
             pages.
             Nonlinear systems help express relations among criteria.
             With a good term weighting function it is possible to use
             lightweight dimension reduction techniques.
             Our system try to ease the communication between technical
             and linguistic experts.
             Anchor texts were also studied as a way of adding contextual
             information.

                     Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                                29 / 30
Motivation   Understanding the system                  Improving the Combination                   Summary




      Thank You!




             Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
                                                                                                        30 / 30

Weitere ähnliche Inhalte

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering - Cicling12

  • 1. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering Alberto P´rez Garc´ e ıa-Plaza, V´ ıctor Fresno, Raquel Mart´ ınez NLP & IR Group, Distance Learning University (UNED) CICLing 2012, New Delhi, India March 15, 2012
  • 2. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 2 / 30
  • 3. Motivation Understanding the system Improving the Combination Summary Motivation Main goal To understand how to represent web pages for clustering. Question How to combine different page features to represent web pages? Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30
  • 4. Motivation Understanding the system Improving the Combination Summary Motivation Main goal To understand how to represent web pages for clustering. Question How to combine different page features to represent web pages? Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 3 / 30
  • 5. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 4 / 30
  • 6. Motivation Understanding the system Improving the Combination Summary Web Page Representation Hypothesis A good document representation should be based on how humans read documents. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 5 / 30
  • 7. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation Criteria: Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 8. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation § ¤ Criteria: ¦ Title ¥ Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 9. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation § ¤§ ¤ Criteria: ¦ Title ¥Emphasis ¦ ¥ Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 10. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation Word positions: Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 11. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation § ¤ Word positions: ¦ Preferential ¥ Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 12. Motivation Understanding the system Improving the Combination Summary Different Criteria for Web Page Representation § ¤§ ¤ Word positions: ¦ Preferential ¥Standard ¥ ¦ Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 6 / 30
  • 13. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 7 / 30
  • 14. Motivation Understanding the system Improving the Combination Summary Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip (1) Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4 (2) Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30
  • 15. Motivation Understanding the system Improving the Combination Summary Linear Combination of Criteria For example: Analytical Combination of Criteria (acc)1 . Importance of a term in a document: Ik = tk it + ek ie + fk if + pk ip (1) Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4 (2) Drawback The importance of a term in a component is calculated regardless the rest of the components. 1 V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf. Syst., 22(3):215–235, 2004. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 8 / 30
  • 16. Motivation Understanding the system Improving the Combination Summary Example: acc Call to Arms Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 9 / 30
  • 17. Motivation Understanding the system Improving the Combination Summary Example: acc Example of rethoric title “Call to arms” is the title of a page that contains an article about the new trades made by New York Yankees baseball team and how these trades affect to Boston Red Sox, their main rival in the Major League Baseball. Drawback Title terms are not related to document topic. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 10 / 30
  • 18. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 11 / 30
  • 19. Motivation Understanding the system Improving the Combination Summary Nonlinear Combination of Criteria Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM. 2 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation. 2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
  • 20. Motivation Understanding the system Improving the Combination Summary Nonlinear Combination of Criteria Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM. 2 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation. 2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
  • 21. Motivation Understanding the system Improving the Combination Summary Nonlinear Combination of Criteria Fuzzy Combination of Criteria (fcc)2 allows nonlinear combinations of criteria. It is possible to define related conditions. It produces vectors within the VSM. 2 A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation. 2003. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 12 / 30
  • 22. Motivation Understanding the system Improving the Combination Summary Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important. Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30
  • 23. Motivation Understanding the system Improving the Combination Summary Example: fcc Example of rethoric title Now, we can express that a term should appear in the title and emphasized to be considered important. Nonlinearity Title terms can be considered not important because they do not appear in the rest of the text. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 13 / 30
  • 24. Motivation Understanding the system Improving the Combination Summary A quick glance at fcc Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
  • 25. Motivation Understanding the system Improving the Combination Summary A quick glance at fcc Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
  • 26. Motivation Understanding the system Improving the Combination Summary A quick glance at fcc Close to natural language. Knowledge base: defined by a set of IF-THEN rules. Rules are based on how humans read documents. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 14 / 30
  • 27. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 15 / 30
  • 28. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 16 / 30
  • 29. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 30. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 31. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 32. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 33. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 34. Motivation Understanding the system Improving the Combination Summary Basic Clustering Settings We remove stopwords, punctuation and suffixes (Porter’s algorithm). Clustering: Cluto-rbr with default parameters. Web page representations: tf-idf and fcc Dimension reduction techniques (100, 500, 1000, 2000 and 5000 features): mft and lsi. Banksearch and Webkb. F-measure to evaluate clustering quality. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 17 / 30
  • 35. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 18 / 30
  • 36. Motivation Understanding the system Improving the Combination Summary Dimension Reduction Analysis Hypothesis If lsi improves mft, then the weighting function is not able to find the most representative terms. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 19 / 30
  • 37. Motivation Understanding the system Improving the Combination Summary Rep. Avg. S.D. Banksearch tf-idf mft 0,748 0,028 tf-idf lsi 0,756 0,005 fcc mft 0,756 0,019 fcc lsi 0,769 0,011 Webkb tf-idf mft 0,460 0,051 tf-idf lsi 0,507 0,006 fcc mft 0,469 0,009 fcc lsi 0,466 0,011 Conclusion The weighting function is not working as well as it could. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30
  • 38. Motivation Understanding the system Improving the Combination Summary Rep. Avg. S.D. Banksearch tf-idf mft 0,748 0,028 tf-idf lsi 0,756 0,005 fcc mft 0,756 0,019 fcc lsi 0,769 0,011 Webkb tf-idf mft 0,460 0,051 tf-idf lsi 0,507 0,006 fcc mft 0,469 0,009 fcc lsi 0,466 0,011 Conclusion Results for fcc in Webkb dataset are surprisingly bad. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 20 / 30
  • 39. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 21 / 30
  • 40. Motivation Understanding the system Improving the Combination Summary Results for Criteria Analysis Rep.Dim. 100 500 1000 2000 5000 Banksearch fcc mft 0,723 0,757 0,768 0,765 0,768 title 0,626 0,646 0,632 0,634 0,639 emphasis 0,586 0,671 0,674 0,685 0,693 frequency 0,689 0,715 0,720 0,724 0,731 position 0,310 0,525 0,538 0,599 0,608 For Banksearch, fcc get always higher values than individual criteria, so the combination works better in all cases. Frequency seems to be the best among the individual criteria. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 22 / 30
  • 41. Motivation Understanding the system Improving the Combination Summary Results for Criteria Analysis Rep.Dim. 100 500 1000 2000 5000 Webkb fcc mft 0,453 0,472 0,475 0,468 0,475 title 0,432 0,433 0,404 0,488 0,479 emphasis 0,415 0,431 0,433 0,465 0,489 frequency 0,441 0,460 0,460 0,468 0,446 position 0,301 0,283 0,317 0,281 0,286 For Webkb, fcc does not always outperform the others. Frequency is not always the best among the individual criteria. When title and emphasis could lead to a better clustering, the combination get worse. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 23 / 30
  • 42. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 24 / 30
  • 43. Motivation Understanding the system Improving the Combination Summary Improving the Combination Frequency should influence the decision more than position. IF Title AND Frequency AND Emphasis AND Position THEN Importance Low Medium Low Preferential ⇒ Low Low Medium Low Standard ⇒ No Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 25 / 30
  • 44. Motivation Understanding the system Improving the Combination Summary Extended Fuzzy Combination of Criteria (efcc) IF Title AND Frequency AND Emphasis AND Position THEN Importance High High ⇒ Very High High Medium Preferential ⇒ High High Medium Standard ⇒ Medium High Low Preferential ⇒ Medium High Low Standard ⇒ Low Low High Preferential ⇒ High Low High Standard ⇒ Medium Low Medium Preferential ⇒ Medium Low Medium Standard ⇒ Low Low Low Preferential ⇒ Low Low Low Standard ⇒ No High ⇒ Very High Medium ⇒ Medium Low ⇒ No Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 26 / 30
  • 45. Motivation Understanding the system Improving the Combination Summary System Comparison With efcc, both reduction methods get similar results. Rep. Avg. S.D. Banksearch tf-idf lsi 0,756 0,005 fcc lsi 0,769 0,011 efcc mft 0,760 0,014 efcc lsi 0,758 0,013 Webkb tf-idf lsi 0,507 0,006 fcc mft 0,469 0,009 efcc mft 0,532 0,032 efcc lsi 0,483 0,000 Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30
  • 46. Motivation Understanding the system Improving the Combination Summary System Comparison efcc solves the problems of fcc in Webkb. Rep. Avg. S.D. Banksearch tf-idf lsi 0,756 0,005 fcc lsi 0,769 0,011 efcc mft 0,760 0,014 efcc lsi 0,758 0,013 Webkb tf-idf lsi 0,507 0,006 fcc mft 0,469 0,009 efcc mft 0,532 0,032 efcc lsi 0,483 0,000 Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 27 / 30
  • 47. Motivation Understanding the system Improving the Combination Summary Table of Contents 1 Motivation Web Page Representation Linear Combination of Criteria Nonlinear Combination of Criteria 2 Understanding the system Experimental Settings Dimension Reduction Analysis Study of Individual Criteria 3 Improving the Combination 4 Summary Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 28 / 30
  • 48. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 49. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 50. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 51. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 52. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 53. Motivation Understanding the system Improving the Combination Summary Summary We present a term weighting function based on how human read documents. The representation is not oriented to concrete sets of web pages. Nonlinear systems help express relations among criteria. With a good term weighting function it is possible to use lightweight dimension reduction techniques. Our system try to ease the communication between technical and linguistic experts. Anchor texts were also studied as a way of adding contextual information. Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 29 / 30
  • 54. Motivation Understanding the system Improving the Combination Summary Thank You! Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering 30 / 30