The document presents the results of a study analyzing social bookmarking data from Delicious to determine if it can improve web search. It finds that Delicious often has recently updated pages, contains 12.5-30% new URLs not in search engines, and covers 9-19% of search results. However, the total number of posts on Delicious is relatively small compared to the size of the web.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
1. Can Social Bookmarking Improve Web Search?
Ashish Jain
Information Retrieval
Paper Presentation
2. Outline
1 Introduction
2 Terminology
3 Collection of Data
4 Related Work
5 URLs
Result 1 (Positive)
Result 2 (Positive)
Result 3 (Positive)
Result 4 (Positive)
Result 5 (Positive)
Result 8 (Negative)
Result 9 (Negative)
6 Tags
Result 6 (Positive)
Result 7 (Positive)
Result 10 (Negative)
Result 11 (Negative)
7 Discussion
3. Introduction
What is social bookmarking?
Show video (http://www.commoncraft.com/video/social-bookmarking).
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 3 / 51
4. Introduction
Figure: Major types of data used by search engines
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 4 / 51
5. Introduction
What information does del.icio.us have?
Lots of < url, tag , user > tuples.
How can del.icio.us information help a search engine?
If the URLs are unknown to a search engine, they can be added to the
list of URLs to be crawled.
Vocabulary problem: Users use different words to refer to the same
information. For example, a user searching for pain killers might enter
the query “analgesic”.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 5 / 51
6. Introduction
Possibilities
Suppose K represents known to a search engine and U represents unknown
to a search engine.
Tags (K) Tags (U)
URLs (K) Both known Tags unknown
URLs (U) URLs unknown Both Tags and URLs unknown
When will del.icio.us information be useful to a search engine?
When the URLs of del.icio.us is not a subset of the URLs crawled by
a search engine.
Tags given to a particular web page are not present in the URL, title,
content of a web page.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 6 / 51
7. Introduction
Authors are trying to find answers to the following questions:
How often do we find “non-obvious” tags?
Is del.icio.us really more up-to-date than a search engine?
What coverage does delicious have of the web?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 7 / 51
8. Terminology
Definitions
Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i has
tagged URL k with tag j.
Post A post is a URL bookmarked by a user and the associated meta
data. A post is made up of many triples, though it may contained
information like a user comment.
Label A label is a < tagi , urlk > pair that signifies that at least one triple
containing tag i and URL k exists in the system.
Host Full host part of a URL example in
http://i.stanford.edu/index.html, i.stanford.edu is the host.
Domain Institutional level part of the host example in
http://i.stanford.edu/index.html, stanford.edu is the domain.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 8 / 51
9. Collection of Data
Possible Sources
Del.icio.us Interfaces
“Recent” feed provides the most recent bookmarks posted to
del.icio.us in real time
All posts for a given URL
All posts by a given user
Most recent posts with a given tag
Crawl
Alternatively, one can crawl del.icio.us treating it as a tripartite graph of
users, URLs and tags.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 9 / 51
10. Collection of Data
Datasets
(C)rawl (R)ecent (M)onth
Large scale crawl of Data gathered using Data gathered from
del.icio.us in del.icio.us recent feed del.icio.us recent feed
September 2006. interface for nearly 8 interface for one
months beginning complete month
September 28, 2006. starting May 25,
2007. Gathering
process enhanced so
more accurate than
the R dataset.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 10 / 51
11. Collection of Data
Comparison
(C)rawl (R)ecent (M)onth
Posts ≈ 22M ≈ 11M ≈ 3.6M
Unique URLs ≈ 1.3M ≈ 3M ≈ 2.5M
Disadvantage Biased towards Missing data Missing data
popular URLs, tags, users
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 11 / 51
12. Collection of Data
Query Dataset
AOL Query Dataset
About 20 million search queries by roughly 650,000 users
Used to simulate distribution of queries that a search engine might receive.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 12 / 51
13. URLs
Figure: Overview
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 13 / 51
14. URLs Result 1 (Positive)
Result 1
Aim
Are pages posted to del.icio.us often recently modified?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 14 / 51
15. URLs Result 1 (Positive)
Methodology
Modification Date of a Web page
As we studied in previous papers, determining the exact modification
date of a web page is hard.
The search engines have to estimate the modification date of a web
page in order to crawl the web efficiently.
Yahoo! Search API gives the modification date of a web page.
Authors use the same to determine the modification date of a web
page.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 15 / 51
16. URLs Result 1 (Positive)
Methodology
Compare
del.icio.us Pages sampled from del.icio.us recent feed as they were
posted
Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of
Yahoo! searches for queries sampled from the AOL query
dataset.
ODP Pages sampled from the Open Directory Project (dmoz.org)
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 16 / 51
17. URLs Result 1 (Positive)
Results
Pages from del.icio.us are often more recently modified than ODP
Found a correlation between a search result being ranked higher and a
result having been modified more recently.
Top 10 results from Yahoo! Search were about the same age as the
pages found bookmarked in del.icio.us .
Conclusion
del.icio.us users post interesting pages that are actively updated or have
been recently created.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 17 / 51
18. URLs Result 2 (Positive)
Result 2
Aim
How many pages belonging to del.icio.us are not known to a search engine?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 18 / 51
19. URLs Result 2 (Positive)
Methodology
Sample pages from the del.icio.us feed as they were posted, and then
run searches on those pages immediately after.
Of those pages, about 42.5% were not found. This could be due to
several reasons:
Page is indexed under another canonicalized URL
Could be spam
Could be an odd MIME-type for example an image
Page could not have been found yet
Continuously search for the web page in the next four weeks. If found
assume it was not indexed.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 19 / 51
20. URLs Result 2 (Positive)
Result
Out of 5,724 URLS which were sampled and were missing, 1,750 were
later found.
Implies roughly 30% of the missing URLs were new URLs.
Implies 12.5% of del.icio.us i.e. 42.5% × 30%.
Conclusion
del.icio.us can serve as a (small) data source for new web pages and to
help crawl ordering.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 20 / 51
21. URLs Result 2 (Positive)
Figure: Result 2
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 21 / 51
22. URLs Result 3 (Positive)
Aim
Check coverage of search results by del.icio.us
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 22 / 51
23. URLs Result 3 (Positive)
Methodology
Sample queries from AOL dataset based on query event frequency
(Implies biased towards popular queries).
Run query on Yahoo! Search
Intersect search results with datasets C, M, R.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 23 / 51
24. URLs Result 3 (Positive)
Results
For the top 100 results, del.icio.us covers 9% of the results returned
for a set of over 30,000 queries.
For the top 10 results, del.icio.us covers 19% of the results returned.
Conclusion
del.icio.us users are disproportionately common in search results compared
to their coverage.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 24 / 51
25. URLs Result 4 (Positive)
Q. Are there some subset of users responsible for most of the data in
del.icio.us ?
On social news sites, it is commonly cited that the majority of front
page posts come from a dedicated group of less than 100 users.
del.icio.us does exhibit some of these traits but it is not as dependent
on some relatively small group of users.
The top 10% only account for 56% of the posts.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 25 / 51
26. URLs Result 4 (Positive)
Figure: Result 4
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 26 / 51
27. URLs Result 5 (Positive)
How much of the new information added to del.icio.us is new?
Estimated using dataset M.
A new post in dataset M was not in del.icio.us 40% of the time.
Should be about 30% after adjusting for filtering (How did they come
up with this number is not known!)
How often is a completely new domain added to del.icio.us?
12% of posts in Dataset M were URLs whose domains were not in
either Dataset C or R.
Implies about 1/8th of the time
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 27 / 51
28. URLs Result 5 (Positive)
Figure: Result 5
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 28 / 51
29. URLs Result 8 (Negative)
Aim
How many URLs are posted to del.icio.us every day?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 29 / 51
30. URLs Result 8 (Negative)
Methodology
Plot the posts for every hour in Dataset M and compare the same
with data collected by Philipp Keller a . The two are mutually
reinforcing.
Also plot posts from dataset R.
a
http://deli.ckoma.net/stats (Defunct website)
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 30 / 51
31. URLs Result 8 (Negative)
Results
About 92,000 posts per day of each weekend
About 133,000 posts per weekday
Implies about 851,000 posts per week
About 44 million posts per year a
a
There are about 1.5 million blog posts per day
Conclusion
Compared to blog posts, the number of posts per day is small about
1/10
Posting rate on del.icio.us is marked by a series of increases followed
by periods of relative stability.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 31 / 51
32. URLs Result 9 (Negative)
Aim
What is the size of del.icio.us ?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 32 / 51
33. URLs Result 9 (Negative)
Methodology
Divide time into three sets.
t1 Period before Schacter’s announcement on May 24th a
t2 May 24th and start of Philipp Keller’s data gathering
t3 Start of Philipp Keller’s data gathering to the present
t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f )
Equal to about 117 million posts b
Reasonable estimate should be between 60 and 150 million posts.c
Estimate between 20 and 50 percent of posts are unique URLs.
a
Joshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were
400,000 posts and 200,000 URLs.
b
Most likely an overestimate as the authors chose upper bound values for db and dk .
c
It does not include private posts
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 33 / 51
34. URLs Result 9 (Negative)
Results
There are about 115 million public posts a .
There are about 30-50 million unique URLs.
a
They estimate that there are between 60 and 150 million posts. 115 million is not
an average of 60 and 150 million!
Conclusion
The number of total posts is relatively small compared to the web as a
whole.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 34 / 51
35. URLs Result 9 (Negative)
Figure: Result 9
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 35 / 51
36. Tags Result 6 (Positive)
Aim
Is there any correlation between tags and queries?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 36 / 51
37. Tags Result 6 (Positive)
Methodology
Checked the tag-query overlap between the tags in dataset M and the
query terms in the AOL query dataset.
22% of the AOL query dataset is made up of queries. Removed those.
Removed certain stop word like tags from dataset M.
Plotted number of times a tag occurs in Dataset M versus the
number of times it occurs in the AOL query dataset.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 37 / 51
38. Tags Result 6 (Positive)
Figure: A scatter plot of tag count versus query count for top tags and queries in
del.icio.us and AOL query dataset
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 38 / 51
39. Tags Result 6 (Positive)
Results
One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%,
36.8% of these non-domain, non-URL queries.
Conclusion
del.icio.us may be able to help with queries where tags overlap with query
terms.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 39 / 51
40. Tags Result 7 (Positive)
Aim
Are the tags in del.icio.us of good quality? Are they non-sensical tags like
“cool”, “fi32”, etc.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 40 / 51
41. Tags Result 7 (Positive)
Methodology: User Study
10 people (graduate students and “mix of individuals associated with
our department”) manually evaluate posts to determine their quality.
Sampled one post out of every five hundred, and then gave blocks of
posts for individuals to label.
Most individuals labeled 100 to 150 posts.
For each tag, we asked whether the tag was “relevant”, “applies to
the whole document,” and/or “subjective.”
Bar for relevance was set low: whether a random person would agree
that it was reasonable to say that the tag described the page.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 41 / 51
42. Tags Result 7 (Positive)
Results
Only about 7% were deemed subjective (less than one in twenty for
all users)
No “spam”
Conclusion
Tags on the whole are of good quality.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 42 / 51
43. Tags Result 10 (Negative)
Aim
Do people use tags which are not obvious from the context?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 43 / 51
44. Tags Result 10 (Negative)
Methodology
Randomly pick 20,000 posts from Dataset M.
Convert HTML to text. Also look at page text of pages that link to
the URL in question (backlinks) and pages that are linked from the
URL in question (forward links).
Extract tokens. Check whether pages are in English or not.
Lower case all tags and tokens.
Compare
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 44 / 51
45. Tags Result 10 (Negative)
Results
50% of the time tag is in the page text
16% of the time it is in the title itself
20% of the time it’ll appear in three places: the page it annotates, at
least one of its backlinks, at least one of its forward links.
80% of the time, tags will appear in one of three places: the page, its
backlinks, its forward links.
The tags in the other 20% seem to be of lower quality: misspellings,
confusing tagging schemes (food/dining).
Conclusion
Most tags can be discovered by a search engine
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 45 / 51
46. Tags Result 11 (Negative)
Aim
Are some domains strongly correlated with particular tags and vice-versa?
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 46 / 51
47. Tags Result 11 (Negative)
Example
Table: This example lists the five hosts in Dataset C with the most URLs
annotated with the tag java.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 47 / 51
48. Tags Result 11 (Negative)
Methodology
Used Dataset C which is highly biased towards popular URLs, tags
and users. Therefore, the results of this experiment do not necessarily
apply to del.icio.us as a whole.
Build a simple binary classifier and see how it does.
Figure: Function for classification
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 48 / 51
49. Tags Result 11 (Negative)
Result
Domains are often highly correlated with particular tags and vice-versa.
Conclusion
It may be more efficient to train librarians to label domains than to ask
users to tag pages.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 49 / 51
50. Discussion
Summary
Advantages
Actively updated
Prominent in search results
Tags are relevant and objective
Disadvantages
Small amount of data
Tags in titles, page text, URLs
Not good enough to be used by major search engines.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 50 / 51
51. Discussion
Discussion
Personalized search using del.icio.us bookmarks.
I found the conclusions drawn in subsection Result 1 hard to believe.
I found the conclusions drawn in subsection Result 5 hard to believe.
I found the conclusions drawn in subsection Result 11 hard to believe.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51
52. Discussion
Heymann, Koutrika, and Garcia-Molina. 2008. Can Social
Bookmarking Improve Web Search? WSDM 2008.
Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51