SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Introduction                      Clustering                 Alignment




               Doctoral Seminar: Multi-document clustering
                             and alignment

                               Wim De Smet


                              March 23, 2007
Introduction                          Clustering      Alignment



                                Current goals




       CLASS, WP7
          1. Cluster documents according to topics.
          2. Align text and video
Introduction                          Clustering                         Alignment



                                     Goal




       Given news stories about different events, from several sources,
       cluster same stories.
Introduction                          Clustering                 Alignment



                                  Clustering



       Typical clustering algorithms: bag   of words approach.
       Document-by-words matrix:
             0.5 0.5 0.5 0            0      0
             0.4 0.6 0.5 0            0      0
             0.5 0.4 0.6 0            0      0
       A= 0         0     0 0.5 0.5         0.5
              0     0     0 0.5 0.5         0.5
             0.4 0.4 0 0.4 0.4              0.4
             0.4 0.4 0.4 0 0.4              0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Document clustering according to word-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                           Clustering          Alignment



                                    Clustering
       Word clustering according to document-similarity:

          0.5     0.5   0.5    0      0       0
          0.4     0.6   0.5    0      0       0
          0.5     0.4   0.6    0      0       0
       A= 0        0     0    0.5    0.5     0.5
           0       0     0    0.5    0.5     0.5
          0.4     0.4    0    0.4    0.4     0.4
          0.4     0.4   0.4    0     0.4     0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                        Clustering                   Alignment



                              Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0
       A= 0          0   0 0.5 0.5 0.5
               0     0   0 0.5 0.5 0.5
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                         Clustering                   Alignment



                               Co-clustering
       Purpose: simultaneously clustering words and documents,
       preserving information found in both clusterings.
             0.5 0.5 0.5 0           0    0
             0.4 0.6 0.5 0           0    0
             0.5 0.4 0.6 0           0    0             0.5 0
       A= 0          0   0 0.5 0.5 0.5                   0 0.5
               0     0   0 0.5 0.5 0.5                  0.4 0.4
             0.4 0.4 0 0.4 0.4 0.4
             0.4 0.4 0.4 0 0.4 0.4
Introduction                           Clustering                          Alignment



                         Hierarchical Co-clustering




       Hierarchical co-clustering:
          1. Co-cluster documents and words.
          2. For each cluster: if contains too many documents, calculate
             sub-matrix
          3. Repeat step 1 on sub-matrix.
Introduction                        Clustering                Alignment



          Bipartite Spectral Graph Partitioning: motivation

       View document-by-word matrix as bipartite graph

                         word1   word2       word2
          document1       a1,1     0           0
       A=
          document2        0      a2,2        a2,3
          document2       a3,2    a3,3         0
Introduction                         Clustering                       Alignment



          Bipartite Spectral Graph Partitioning: motivation
       Divide graph in document clusters Dm and associated word clusters
       Wm ?
Introduction                        Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                          
                                                                          
               Wm = wj :          Aij ≥          Aij , ∀l = 1, . . . , k
                                                                          
                           i∈Dm           i∈Dl
Introduction                         Clustering                                 Alignment



          Bipartite Spectral Graph Partitioning: motivation
                                                                           
                                                                           
               Dm = di :          Aij ≥           Aij , ∀l = 1, . . . , k
                                                                           
                           j∈Wm           j∈Wl
Introduction                           Clustering                    Alignment



           Bipartite Spectral Graph Partitioning: algorithm

          1. Given the m ∗ n document-by-word matrix A, calculate
             diagonal help-matrices D1 and D2 , so that:

                         ∀1 < i ≤ m : D1 (i, i) =        Ai,j
                                                     j

                          ∀1 < j ≤ n : D2 (j, j) =       Ai,j
                                                     i

          2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2
          3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗
          4. Determine k, the numbers of clusters by the eigengap:
             k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where
             λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction                              Clustering                            Alignment



   Bipartite Spectral Graph Partitioning: algorithm (cont.)


          5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
             respectively, by taking columns 2 to l + 1
             where l = log2 k ,
                               D1 −1/2 U[2,··· ,l+1]
          6. Compute Z =                               and normalize the rows
                               D2 −1/2 V[2,··· ,l+1]
               of Z
          7. Apply k-means to cluster the rows of Z into k clusters
          8. Check for each clusters the number of documents. If this is
             higher than a given treshold, construct a new
             document-by-word matrix formed by the documents and
             words in the cluster, and proceed to step 1
Introduction                           Clustering                   Alignment



                   Uses of a hierarchical co-clustering




           • Documents are clustered according to topic hierarchy
           • Words associated with cluster describe topic
           • Words can be used for offline clustering
Introduction                     Clustering            Alignment



                  Entries of document-by-word matrix




          1. TF-IDF
          2. WP 2’s Salience
Introduction                            Clustering                  Alignment



                                   Results

       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap
       Salience: 3743 words / TF-IDF: 7242 words

       Co-clustering
        Test set Precision     Recall        F1
        Salience 74.6 %         41 %       52.9 %
        TF-IDF      50.4 %     40.7 %      45.1 %

       k-means
        Test set   Precision   Recall        F1
        Salience    69.5 %     37.1 %      48.4 %
        TF-IDF      38.3 %     41.8 %       40 %
Introduction                            Clustering                  Alignment



                                   Results


       Precision of clustering 367 news stories from ABC and CNN.
       k = defined by eigengap

       Co-clustering
        Test set Precision     Recall        F1
        Salience 64.3 %        48.3 %      55.2 %

       k-means
        Test set   Precision   Recall        F1
        Salience    58.3 %     41.7 %      48.8 %
Introduction                             Clustering          Alignment



                                      Goals
          1. Find aligning segments in
               1.1 text-text pairs
               1.2 text-video pairs
          2. Expand to multiple documents (text and video)
Introduction                            Clustering            Alignment



                                      Goals




       Using aligned segments:
           • Create elaborated story from several sources
           • Create links between video and text
           • Summarize video and text
           • Select appropriate medial form for information
Introduction                          Clustering             Alignment



                                  Segments


       Segments can be defined at different resolutions
           • in text:
                • word
                • sentence
                • paragraph
           • in video:
                • image
                • shot
           • Expand to multiple documents (text and video)
Introduction                             Clustering                 Alignment



                                   Problems




           • Degrees of comparability:
               • Parallel pairs
               • Near-parallel pairs
               • Comparable pairs
           • Representation of segments in different media: how to
               compare
Introduction                                Clustering   Alignment



                                         Techniques



    • Micro-macro aligment
        • Top-down
        • Bottom-up
    • Make use of several
        assumptions:
               • Linearity
               • Low variance of slope
               • Injectivity
    • Annealing and Context
Introduction                          Clustering      Alignment



                            Multiple documents




       Two possible directions
          1. Dimension reduction
          2. Expand dimensions of search algorithms

Weitere ähnliche Inhalte

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Presentatie

  • 1. Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007
  • 2. Introduction Clustering Alignment Current goals CLASS, WP7 1. Cluster documents according to topics. 2. Align text and video
  • 3. Introduction Clustering Alignment Goal Given news stories about different events, from several sources, cluster same stories.
  • 4. Introduction Clustering Alignment Clustering Typical clustering algorithms: bag of words approach. Document-by-words matrix: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 5. Introduction Clustering Alignment Clustering Document clustering according to word-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 6. Introduction Clustering Alignment Clustering Word clustering according to document-similarity: 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 7. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 8. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 A= 0 0 0 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 9. Introduction Clustering Alignment Co-clustering Purpose: simultaneously clustering words and documents, preserving information found in both clusterings. 0.5 0.5 0.5 0 0 0 0.4 0.6 0.5 0 0 0 0.5 0.4 0.6 0 0 0 0.5 0 A= 0 0 0 0.5 0.5 0.5 0 0.5 0 0 0 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0 0.4 0.4 0.4 0.4 0.4 0.4 0 0.4 0.4
  • 10. Introduction Clustering Alignment Hierarchical Co-clustering Hierarchical co-clustering: 1. Co-cluster documents and words. 2. For each cluster: if contains too many documents, calculate sub-matrix 3. Repeat step 1 on sub-matrix.
  • 11. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation View document-by-word matrix as bipartite graph word1 word2 word2 document1 a1,1 0 0 A= document2 0 a2,2 a2,3 document2 a3,2 a3,3 0
  • 12. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation Divide graph in document clusters Dm and associated word clusters Wm ?
  • 13. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Wm = wj : Aij ≥ Aij , ∀l = 1, . . . , k   i∈Dm i∈Dl
  • 14. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: motivation     Dm = di : Aij ≥ Aij , ∀l = 1, . . . , k   j∈Wm j∈Wl
  • 15. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm 1. Given the m ∗ n document-by-word matrix A, calculate diagonal help-matrices D1 and D2 , so that: ∀1 < i ≤ m : D1 (i, i) = Ai,j j ∀1 < j ≤ n : D2 (j, j) = Ai,j i 2. Compute An = D1 −1/2 ∗ A ∗ D2 −1/2 3. Take the SVD of An : SVD(An ) = U ∗ Λ ∗ V∗ 4. Determine k, the numbers of clusters by the eigengap: k = arg max(m≥i>1) λi−1 − λi )/λi−1 , where λ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
  • 16. Introduction Clustering Alignment Bipartite Spectral Graph Partitioning: algorithm (cont.) 5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1] respectively, by taking columns 2 to l + 1 where l = log2 k , D1 −1/2 U[2,··· ,l+1] 6. Compute Z = and normalize the rows D2 −1/2 V[2,··· ,l+1] of Z 7. Apply k-means to cluster the rows of Z into k clusters 8. Check for each clusters the number of documents. If this is higher than a given treshold, construct a new document-by-word matrix formed by the documents and words in the cluster, and proceed to step 1
  • 17. Introduction Clustering Alignment Uses of a hierarchical co-clustering • Documents are clustered according to topic hierarchy • Words associated with cluster describe topic • Words can be used for offline clustering
  • 18. Introduction Clustering Alignment Entries of document-by-word matrix 1. TF-IDF 2. WP 2’s Salience
  • 19. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Salience: 3743 words / TF-IDF: 7242 words Co-clustering Test set Precision Recall F1 Salience 74.6 % 41 % 52.9 % TF-IDF 50.4 % 40.7 % 45.1 % k-means Test set Precision Recall F1 Salience 69.5 % 37.1 % 48.4 % TF-IDF 38.3 % 41.8 % 40 %
  • 20. Introduction Clustering Alignment Results Precision of clustering 367 news stories from ABC and CNN. k = defined by eigengap Co-clustering Test set Precision Recall F1 Salience 64.3 % 48.3 % 55.2 % k-means Test set Precision Recall F1 Salience 58.3 % 41.7 % 48.8 %
  • 21. Introduction Clustering Alignment Goals 1. Find aligning segments in 1.1 text-text pairs 1.2 text-video pairs 2. Expand to multiple documents (text and video)
  • 22. Introduction Clustering Alignment Goals Using aligned segments: • Create elaborated story from several sources • Create links between video and text • Summarize video and text • Select appropriate medial form for information
  • 23. Introduction Clustering Alignment Segments Segments can be defined at different resolutions • in text: • word • sentence • paragraph • in video: • image • shot • Expand to multiple documents (text and video)
  • 24. Introduction Clustering Alignment Problems • Degrees of comparability: • Parallel pairs • Near-parallel pairs • Comparable pairs • Representation of segments in different media: how to compare
  • 25. Introduction Clustering Alignment Techniques • Micro-macro aligment • Top-down • Bottom-up • Make use of several assumptions: • Linearity • Low variance of slope • Injectivity • Annealing and Context
  • 26. Introduction Clustering Alignment Multiple documents Two possible directions 1. Dimension reduction 2. Expand dimensions of search algorithms