Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Grouper
1. TECHNICAL SEMINAR
ON
GROUPER: A DYNAMIC CLUSTERING
INTERFACE TO WEB SEARCH RESULTS
BY
PREET KANWAL
Dr. AMBEDKAR INSTITUTE OF TECHNOLOGY, BANGALORE-56
4. Problem of search engine
• Search engine return long ordered list of document
“snippets”.
5. Disadvantage
Ranked list presentation.
Users forced to sift through to find relevant
document.
Wastage of time.
Low precision.
6. Document clustering
Alternative method for organizing retrieval
results.
Algorithms groups the documents based on their
similarities.
Advantages:
Easy to locate.
Overview of retrieved document set.
8. Post- retrieval Document Clustering
Superior results.
Clusters computed based on returned doc set.
Cluster boundaries appropriately partition set of
documents at hand.
9. Pre-Retrieval document clustering
Offline clustering of documents.
Document clustering performed in advance on
the collection as whole.
Might be based on features infrequent in
retrieved set.
10. Problem with search engines
Severe resource constraints.
Cannot dedicate enough CPU time to each
query – NOT FEASIBLE.
Hence clusters have to be PRE-COMPUTED.
11. PROPOSED SOLUTION
GROUPER:
Document
clustering interface to HuskySearch
meta search service.
HuskySearch meta-search engine:
Based on MetaCrawler.
Retrieves results from several popular web search
engines.
Clusters results using STC algorithm.
14. Goals
1)Coherent Clusters:
Group similar documents together.
2)Efficiently Browsable:
Generate overlapping
Cluster description must clusters when appropriate.
be3)Speed:
Algorithmic Speed.
Concise.
Accurate.
Snippet tolerance.
Clustering can be done in 2 ways:
a)Clustering snippets.
b)Download and cluster.
15. Overview of STC Algorithm
Linear time clustering alg.
Based on identifying phrases common to group
of documents.
PHRASE:Ordered sequence of one or more
words.
BASE CLUSTER:Set of documents that share a
common phrase.
16. STC has 3 logical steps
1)Document “cleaning”:
Transformation- using Light stemming Alg.
2)Identification of Base are marked; non-word
Sentence boundaries Clusters:
tokens are stripped.
Inverted Base Clusters intousing a D.S. called
3)Merging index of phrases- clusters:
Eg: Hello..!!
SUFFIXdegree of overlap.
High TREE.
sentence cluster assigned a SCORE.
non-word token
Each baseboundarysemantically.(shared
Clusters ; coherent
SCORE(No. of doc’s,No. of words in phrase).
Hello
..!!
phrases)
Stoplist is maintained.
17. STC Characteristics
Overlapping clusters ; Shared Phrases.
Fast and incremental.
Doesnot coerce the documents in predefined
number of clusters.
21. DESIGN FOR SPEED
3 characteristics that make Grouper fast:
1)Incrementally of Clustering Algorithm.
STC incremental.
2)Efficient Implementation.
STC performsuse free CPU time.comparisons.
Grouper can large no. of string
3)Ability to form coherent into a unique integer.
Each word result immediately after last document arrives.
Produces transformed clusters based on snippets.
Faster comparisons. results:
2 modes of clustering
Documents of each base cluster encoded as bit vector
a) Cluster the snippets (fast).
for efficient calculation of document overlap.
b) Download and cluster
Additional speedup: (high clustering quality)
a)Remove leading and ending stopped words. Eg:the vice
president of – vice president.
b)Strip off words that do not appear in minimal no. of
documents.
22. EMPIRICAL EVALUATION OF
GROUPER
Difficult.
Heterogeneous user population.
Search for a wide variety of tasks.
Documents retrieved in Husky
STC Producesdoc’s followed
Same no. of coherent clusters.
Search sessions clusters using:
Calculate no. of clustered
STC algorithm
followed
K-means clustering algorithm.
STC>K-means
23. Comparison to a Ranked List
Display
Compared with HuskySearch based on:
1. Number of documents followed
2. Time spent
3. Click distance
24. No. of doc’s followed by users
3 hypothesis made:
1)Easier to find interesting doc.
2)Help find additional interesting doc.
3)Helps in tasks where several doc’s required.
Percentage of sessions in which users followed multiple
documents is higher in Grouper
25. Time spent on each doc followed
Time spent = time to download
Time Spent= time spent in network delays+ time in reading
+time traversing the results
doc’s+time into view selected doc presentation.
+time to find next doc of interest
or
it’s the time between a user’s request for doc and user’s
previous request.
26. Click distance
Distance between successive user’s clicks
on document set.
In ranked list interface:
Click distance= no. of snippets between 2
clicks.
22 snippets scanned
In clustering interface:
1
1
1
Additional cost of skipping snippets.
2
2
2
3
3
3
Any cluster visited; all snippets are scanned. 4
4
4
5
.
.
.
.
.
.
20
18
Cluster 1
5
.
.
.
.
.
.
20
5
.
.
.
.
.
.
20
Cluster 2
Cluster 3
4
27. CONCLUSION
•
•
Grouper
Empirical assessment of user behavior given a clustering interface
to web search results.
• Comparison to the logs of Husky Search.
• Problems:
1)May fail to capture semantic distinctions that user’s expect-while
merging base clusters into clusters.
2)Difficult to navigate if num of clusters are more.
•
Solution: Grouper II
1)Allows users to view non merged base clusters.
2)Supports a hierarchal and interactive interface.