SlideShare ist ein Scribd-Unternehmen logo
1 von 28
TECHNICAL SEMINAR
ON

GROUPER: A DYNAMIC CLUSTERING
INTERFACE TO WEB SEARCH RESULTS

BY
PREET KANWAL
Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY, BANGALORE-56
OUTLINE
Problem Definition.
Problem Definition.
Proposed Solution & Goals
Proposed Solution & Goals
How Groupers work??
How Groupers work??
Empirical Evolution
Empirical Evolution
Conclusion
Conclusion
PROBLEM DEFINITION

Search engine results are not easy to browse
Problem of search engine
• Search engine return long ordered list of document
“snippets”.
Disadvantage
 Ranked list presentation.
Users forced to sift through to find relevant
document.
 Wastage of time.
 Low precision.
Document clustering
 Alternative method for organizing retrieval
results.
 Algorithms groups the documents based on their
similarities.
Advantages:
 Easy to locate.
 Overview of retrieved document set.
Document Clustering
Pre-Retrieval
method

Post-retrieval
method
Post- retrieval Document Clustering
 Superior results.
 Clusters computed based on returned doc set.
 Cluster boundaries appropriately partition set of
documents at hand.
Pre-Retrieval document clustering
Offline clustering of documents.
Document clustering performed in advance on
the collection as whole.
Might be based on features infrequent in
retrieved set.
Problem with search engines
Severe resource constraints.
Cannot dedicate enough CPU time to each
query – NOT FEASIBLE.
Hence clusters have to be PRE-COMPUTED.
PROPOSED SOLUTION
GROUPER:
Document

clustering interface to HuskySearch
meta search service.
HuskySearch meta-search engine:
Based on MetaCrawler.
Retrieves results from several popular web search
engines.
Clusters results using STC algorithm.
Advantages
Easily browsable.
Addresses scalability issue.
No additional resource demands on search
engine.
Fast.
Runs on client machine.
Suitable for distributed IR systems.
Goals
1)Coherent Clusters:
 Group similar documents together.
2)Efficiently Browsable:
 Generate overlapping
Cluster description must clusters when appropriate.
be3)Speed:
Algorithmic Speed.
Concise.
Accurate.
Snippet tolerance.
Clustering can be done in 2 ways:
a)Clustering snippets.
b)Download and cluster.
Overview of STC Algorithm
 Linear time clustering alg.
 Based on identifying phrases common to group
of documents.
PHRASE:Ordered sequence of one or more
words.
BASE CLUSTER:Set of documents that share a
common phrase.
STC has 3 logical steps
1)Document “cleaning”:
 Transformation- using Light stemming Alg.
2)Identification of Base are marked; non-word
 Sentence boundaries Clusters:
tokens are stripped.
 Inverted Base Clusters intousing a D.S. called
3)Merging index of phrases- clusters:
Eg: Hello..!!
SUFFIXdegree of overlap.
High TREE.
sentence cluster assigned a SCORE.
non-word token
Each baseboundarysemantically.(shared
Clusters ; coherent
SCORE(No. of doc’s,No. of words in phrase).
Hello
..!!
phrases)
Stoplist is maintained.
STC Characteristics
 Overlapping clusters ; Shared Phrases.
 Fast and incremental.
 Doesnot coerce the documents in predefined
number of clusters.
User Interface
Grouper’s Query Interface
A Query Result
Summary of cluster
Refine Query Based On This Cluster
DESIGN FOR SPEED
3 characteristics that make Grouper fast:
1)Incrementally of Clustering Algorithm.
 STC incremental.
2)Efficient Implementation.
STC performsuse free CPU time.comparisons.
Grouper can large no. of string
3)Ability to form coherent into a unique integer.
Each word result immediately after last document arrives.
Produces transformed clusters based on snippets.
Faster comparisons. results:
 2 modes of clustering
Documents of each base cluster encoded as bit vector
a) Cluster the snippets (fast).
for efficient calculation of document overlap.
b) Download and cluster
Additional speedup: (high clustering quality)
a)Remove leading and ending stopped words. Eg:the vice
president of – vice president.
b)Strip off words that do not appear in minimal no. of
documents.
EMPIRICAL EVALUATION OF
GROUPER
Difficult.
Heterogeneous user population.
Search for a wide variety of tasks.
Documents retrieved in Husky
STC Producesdoc’s followed
Same no. of coherent clusters.
Search sessions clusters using:
Calculate no. of clustered
STC algorithm
followed
K-means clustering algorithm.
STC>K-means
Comparison to a Ranked List
Display
Compared with HuskySearch based on:
1. Number of documents followed
2. Time spent
3. Click distance
No. of doc’s followed by users
3 hypothesis made:
1)Easier to find interesting doc.
2)Help find additional interesting doc.
3)Helps in tasks where several doc’s required.
Percentage of sessions in which users followed multiple
documents is higher in Grouper
Time spent on each doc followed
Time spent = time to download
Time Spent= time spent in network delays+ time in reading
+time traversing the results
doc’s+time into view selected doc presentation.
+time to find next doc of interest
or
it’s the time between a user’s request for doc and user’s
previous request.
Click distance
Distance between successive user’s clicks
on document set.
In ranked list interface:
Click distance= no. of snippets between 2
clicks.
22 snippets scanned
In clustering interface:
1
1
1
Additional cost of skipping snippets.
2
2
2
3
3
3
Any cluster visited; all snippets are scanned. 4
4
4
5
.
.
.
.
.
.
20

18

Cluster 1

5
.
.
.
.
.
.
20

5
.
.
.
.
.
.
20

Cluster 2

Cluster 3

4
CONCLUSION
•
•

Grouper
Empirical assessment of user behavior given a clustering interface
to web search results.
• Comparison to the logs of Husky Search.
• Problems:
1)May fail to capture semantic distinctions that user’s expect-while
merging base clusters into clusters.
2)Difficult to navigate if num of clusters are more.
•

Solution: Grouper II
1)Allows users to view non merged base clusters.
2)Supports a hierarchal and interactive interface.
Grouper

Weitere ähnliche Inhalte

Ähnlich wie Grouper

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536IJRAT
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clusteringSK Ahammad Fahad
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.pptHODECE21
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...Mumbai Academisc
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
web clustering engines
web clustering enginesweb clustering engines
web clustering enginesArun TR
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ireKovidaN
 
Enhancing social tagging with a knowledge organization system
Enhancing social tagging with a knowledge organization systemEnhancing social tagging with a knowledge organization system
Enhancing social tagging with a knowledge organization systemMichael Day
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval forWaheeb Ahmed
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajooMeetika Gupta
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14sudhir11292rt
 

Ähnlich wie Grouper (20)

CloWSer
CloWSerCloWSer
CloWSer
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clustering
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Text clustering
Text clusteringText clustering
Text clustering
 
web clustering engines
web clustering enginesweb clustering engines
web clustering engines
 
Query expansion_group42_ire
Query expansion_group42_ireQuery expansion_group42_ire
Query expansion_group42_ire
 
Enhancing social tagging with a knowledge organization system
Enhancing social tagging with a knowledge organization systemEnhancing social tagging with a knowledge organization system
Enhancing social tagging with a knowledge organization system
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
 

Kürzlich hochgeladen

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Kürzlich hochgeladen (20)

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Grouper

  • 1. TECHNICAL SEMINAR ON GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS BY PREET KANWAL Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY, BANGALORE-56
  • 2. OUTLINE Problem Definition. Problem Definition. Proposed Solution & Goals Proposed Solution & Goals How Groupers work?? How Groupers work?? Empirical Evolution Empirical Evolution Conclusion Conclusion
  • 3. PROBLEM DEFINITION Search engine results are not easy to browse
  • 4. Problem of search engine • Search engine return long ordered list of document “snippets”.
  • 5. Disadvantage  Ranked list presentation. Users forced to sift through to find relevant document.  Wastage of time.  Low precision.
  • 6. Document clustering  Alternative method for organizing retrieval results.  Algorithms groups the documents based on their similarities. Advantages:  Easy to locate.  Overview of retrieved document set.
  • 8. Post- retrieval Document Clustering  Superior results.  Clusters computed based on returned doc set.  Cluster boundaries appropriately partition set of documents at hand.
  • 9. Pre-Retrieval document clustering Offline clustering of documents. Document clustering performed in advance on the collection as whole. Might be based on features infrequent in retrieved set.
  • 10. Problem with search engines Severe resource constraints. Cannot dedicate enough CPU time to each query – NOT FEASIBLE. Hence clusters have to be PRE-COMPUTED.
  • 11. PROPOSED SOLUTION GROUPER: Document clustering interface to HuskySearch meta search service. HuskySearch meta-search engine: Based on MetaCrawler. Retrieves results from several popular web search engines. Clusters results using STC algorithm.
  • 12.
  • 13. Advantages Easily browsable. Addresses scalability issue. No additional resource demands on search engine. Fast. Runs on client machine. Suitable for distributed IR systems.
  • 14. Goals 1)Coherent Clusters:  Group similar documents together. 2)Efficiently Browsable:  Generate overlapping Cluster description must clusters when appropriate. be3)Speed: Algorithmic Speed. Concise. Accurate. Snippet tolerance. Clustering can be done in 2 ways: a)Clustering snippets. b)Download and cluster.
  • 15. Overview of STC Algorithm  Linear time clustering alg.  Based on identifying phrases common to group of documents. PHRASE:Ordered sequence of one or more words. BASE CLUSTER:Set of documents that share a common phrase.
  • 16. STC has 3 logical steps 1)Document “cleaning”:  Transformation- using Light stemming Alg. 2)Identification of Base are marked; non-word  Sentence boundaries Clusters: tokens are stripped.  Inverted Base Clusters intousing a D.S. called 3)Merging index of phrases- clusters: Eg: Hello..!! SUFFIXdegree of overlap. High TREE. sentence cluster assigned a SCORE. non-word token Each baseboundarysemantically.(shared Clusters ; coherent SCORE(No. of doc’s,No. of words in phrase). Hello ..!! phrases) Stoplist is maintained.
  • 17. STC Characteristics  Overlapping clusters ; Shared Phrases.  Fast and incremental.  Doesnot coerce the documents in predefined number of clusters.
  • 20. Refine Query Based On This Cluster
  • 21. DESIGN FOR SPEED 3 characteristics that make Grouper fast: 1)Incrementally of Clustering Algorithm.  STC incremental. 2)Efficient Implementation. STC performsuse free CPU time.comparisons. Grouper can large no. of string 3)Ability to form coherent into a unique integer. Each word result immediately after last document arrives. Produces transformed clusters based on snippets. Faster comparisons. results:  2 modes of clustering Documents of each base cluster encoded as bit vector a) Cluster the snippets (fast). for efficient calculation of document overlap. b) Download and cluster Additional speedup: (high clustering quality) a)Remove leading and ending stopped words. Eg:the vice president of – vice president. b)Strip off words that do not appear in minimal no. of documents.
  • 22. EMPIRICAL EVALUATION OF GROUPER Difficult. Heterogeneous user population. Search for a wide variety of tasks. Documents retrieved in Husky STC Producesdoc’s followed Same no. of coherent clusters. Search sessions clusters using: Calculate no. of clustered STC algorithm followed K-means clustering algorithm. STC>K-means
  • 23. Comparison to a Ranked List Display Compared with HuskySearch based on: 1. Number of documents followed 2. Time spent 3. Click distance
  • 24. No. of doc’s followed by users 3 hypothesis made: 1)Easier to find interesting doc. 2)Help find additional interesting doc. 3)Helps in tasks where several doc’s required. Percentage of sessions in which users followed multiple documents is higher in Grouper
  • 25. Time spent on each doc followed Time spent = time to download Time Spent= time spent in network delays+ time in reading +time traversing the results doc’s+time into view selected doc presentation. +time to find next doc of interest or it’s the time between a user’s request for doc and user’s previous request.
  • 26. Click distance Distance between successive user’s clicks on document set. In ranked list interface: Click distance= no. of snippets between 2 clicks. 22 snippets scanned In clustering interface: 1 1 1 Additional cost of skipping snippets. 2 2 2 3 3 3 Any cluster visited; all snippets are scanned. 4 4 4 5 . . . . . . 20 18 Cluster 1 5 . . . . . . 20 5 . . . . . . 20 Cluster 2 Cluster 3 4
  • 27. CONCLUSION • • Grouper Empirical assessment of user behavior given a clustering interface to web search results. • Comparison to the logs of Husky Search. • Problems: 1)May fail to capture semantic distinctions that user’s expect-while merging base clusters into clusters. 2)Difficult to navigate if num of clusters are more. • Solution: Grouper II 1)Allows users to view non merged base clusters. 2)Supports a hierarchal and interactive interface.