SlideShare a Scribd company logo
1 of 52
Download to read offline
Can Social Bookmarking Improve Web Search?

                  Ashish Jain

               Information Retrieval


              Paper Presentation
Outline

1   Introduction
2   Terminology
3   Collection of Data
4   Related Work
5   URLs
       Result 1 (Positive)
       Result 2 (Positive)
       Result 3 (Positive)
       Result 4 (Positive)
       Result 5 (Positive)
       Result 8 (Negative)
       Result 9 (Negative)
6   Tags
       Result 6 (Positive)
       Result 7 (Positive)
       Result 10 (Negative)
       Result 11 (Negative)
7   Discussion
Introduction




What is social bookmarking?
Show video (http://www.commoncraft.com/video/social-bookmarking).




    Ashish Jain (INF384H)         Social Bookmarking     Paper Presentation   3 / 51
Introduction




               Figure: Major types of data used by search engines


Ashish Jain (INF384H)              Social Bookmarking         Paper Presentation   4 / 51
Introduction




What information does del.icio.us have?
Lots of < url, tag , user > tuples.

How can del.icio.us information help a search engine?
   If the URLs are unknown to a search engine, they can be added to the
   list of URLs to be crawled.
    Vocabulary problem: Users use different words to refer to the same
    information. For example, a user searching for pain killers might enter
    the query “analgesic”.




   Ashish Jain (INF384H)        Social Bookmarking       Paper Presentation   5 / 51
Introduction




Possibilities
Suppose K represents known to a search engine and U represents unknown
to a search engine.
              Tags (K)       Tags (U)
 URLs (K) Both known         Tags unknown
 URLs (U) URLs unknown Both Tags and URLs unknown

When will del.icio.us information be useful to a search engine?
   When the URLs of del.icio.us is not a subset of the URLs crawled by
   a search engine.
    Tags given to a particular web page are not present in the URL, title,
    content of a web page.




   Ashish Jain (INF384H)        Social Bookmarking       Paper Presentation   6 / 51
Introduction




Authors are trying to find answers to the following questions:
    How often do we find “non-obvious” tags?
    Is del.icio.us really more up-to-date than a search engine?
    What coverage does delicious have of the web?




   Ashish Jain (INF384H)         Social Bookmarking       Paper Presentation   7 / 51
Terminology




Definitions
       Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i has
              tagged URL k with tag j.
         Post A post is a URL bookmarked by a user and the associated meta
              data. A post is made up of many triples, though it may contained
              information like a user comment.
        Label A label is a < tagi , urlk > pair that signifies that at least one triple
              containing tag i and URL k exists in the system.
         Host Full host part of a URL example in
              http://i.stanford.edu/index.html, i.stanford.edu is the host.
     Domain Institutional level part of the host example in
            http://i.stanford.edu/index.html, stanford.edu is the domain.




   Ashish Jain (INF384H)            Social Bookmarking             Paper Presentation   8 / 51
Collection of Data

   Possible Sources


Del.icio.us Interfaces
     “Recent” feed provides the most recent bookmarks posted to
     del.icio.us in real time
     All posts for a given URL
     All posts by a given user
     Most recent posts with a given tag

Crawl
Alternatively, one can crawl del.icio.us treating it as a tripartite graph of
users, URLs and tags.




    Ashish Jain (INF384H)             Social Bookmarking     Paper Presentation   9 / 51
Collection of Data

  Datasets



(C)rawl                    (R)ecent                       (M)onth
Large scale crawl of       Data gathered using            Data gathered from
del.icio.us in             del.icio.us recent feed        del.icio.us recent feed
September 2006.            interface for nearly 8         interface for one
                           months beginning               complete month
                           September 28, 2006.            starting May 25,
                                                          2007. Gathering
                                                          process enhanced so
                                                          more accurate than
                                                          the R dataset.




   Ashish Jain (INF384H)             Social Bookmarking        Paper Presentation   10 / 51
Collection of Data




Comparison
                             (C)rawl                  (R)ecent       (M)onth
     Posts                   ≈ 22M                     ≈ 11M          ≈ 3.6M
 Unique URLs                ≈ 1.3M                      ≈ 3M          ≈ 2.5M
 Disadvantage            Biased towards              Missing data   Missing data
                    popular URLs, tags, users




   Ashish Jain (INF384H)                 Social Bookmarking             Paper Presentation   11 / 51
Collection of Data




Query Dataset
AOL Query Dataset
About 20 million search queries by roughly 650,000 users
Used to simulate distribution of queries that a search engine might receive.




    Ashish Jain (INF384H)             Social Bookmarking   Paper Presentation   12 / 51
URLs




                        Figure: Overview



Ashish Jain (INF384H)     Social Bookmarking   Paper Presentation   13 / 51
URLs    Result 1 (Positive)

   Result 1




Aim
Are pages posted to del.icio.us often recently modified?




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   14 / 51
URLs    Result 1 (Positive)

   Methodology



Modification Date of a Web page
   As we studied in previous papers, determining the exact modification
   date of a web page is hard.
    The search engines have to estimate the modification date of a web
    page in order to crawl the web efficiently.
    Yahoo! Search API gives the modification date of a web page.
    Authors use the same to determine the modification date of a web
    page.




   Ashish Jain (INF384H)     Social Bookmarking              Paper Presentation   15 / 51
URLs    Result 1 (Positive)

   Methodology




Compare
 del.icio.us Pages sampled from del.icio.us recent feed as they were
             posted
Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of
             Yahoo! searches for queries sampled from the AOL query
             dataset.
        ODP Pages sampled from the Open Directory Project (dmoz.org)




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   16 / 51
URLs    Result 1 (Positive)




Results
    Pages from del.icio.us are often more recently modified than ODP
    Found a correlation between a search result being ranked higher and a
    result having been modified more recently.
            Top 10 results from Yahoo! Search were about the same age as the
            pages found bookmarked in del.icio.us .

Conclusion
del.icio.us users post interesting pages that are actively updated or have
been recently created.




    Ashish Jain (INF384H)         Social Bookmarking              Paper Presentation   17 / 51
URLs    Result 2 (Positive)

   Result 2




Aim
How many pages belonging to del.icio.us are not known to a search engine?




   Ashish Jain (INF384H)      Social Bookmarking              Paper Presentation   18 / 51
URLs    Result 2 (Positive)




Methodology
    Sample pages from the del.icio.us feed as they were posted, and then
    run searches on those pages immediately after.
    Of those pages, about 42.5% were not found. This could be due to
    several reasons:
           Page is indexed under another canonicalized URL
           Could be spam
           Could be an odd MIME-type for example an image
           Page could not have been found yet
    Continuously search for the web page in the next four weeks. If found
    assume it was not indexed.




   Ashish Jain (INF384H)         Social Bookmarking              Paper Presentation   19 / 51
URLs    Result 2 (Positive)




Result
    Out of 5,724 URLS which were sampled and were missing, 1,750 were
    later found.
    Implies roughly 30% of the missing URLs were new URLs.
    Implies 12.5% of del.icio.us i.e. 42.5% × 30%.

Conclusion
del.icio.us can serve as a (small) data source for new web pages and to
help crawl ordering.




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   20 / 51
URLs    Result 2 (Positive)




                        Figure: Result 2



Ashish Jain (INF384H)    Social Bookmarking              Paper Presentation   21 / 51
URLs    Result 3 (Positive)




Aim
Check coverage of search results by del.icio.us




    Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   22 / 51
URLs    Result 3 (Positive)




Methodology
    Sample queries from AOL dataset based on query event frequency
    (Implies biased towards popular queries).
    Run query on Yahoo! Search
    Intersect search results with datasets C, M, R.




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   23 / 51
URLs    Result 3 (Positive)




Results
    For the top 100 results, del.icio.us covers 9% of the results returned
    for a set of over 30,000 queries.
     For the top 10 results, del.icio.us covers 19% of the results returned.

Conclusion
del.icio.us users are disproportionately common in search results compared
to their coverage.




    Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   24 / 51
URLs    Result 4 (Positive)




Q. Are there some subset of users responsible for most of the data in
del.icio.us ?
      On social news sites, it is commonly cited that the majority of front
      page posts come from a dedicated group of less than 100 users.
     del.icio.us does exhibit some of these traits but it is not as dependent
     on some relatively small group of users.
     The top 10% only account for 56% of the posts.




    Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   25 / 51
URLs    Result 4 (Positive)




                        Figure: Result 4



Ashish Jain (INF384H)    Social Bookmarking              Paper Presentation   26 / 51
URLs    Result 5 (Positive)




How much of the new information added to del.icio.us is new?
   Estimated using dataset M.
    A new post in dataset M was not in del.icio.us 40% of the time.
    Should be about 30% after adjusting for filtering (How did they come
    up with this number is not known!)
    How often is a completely new domain added to del.icio.us?
           12% of posts in Dataset M were URLs whose domains were not in
           either Dataset C or R.
           Implies about 1/8th of the time




   Ashish Jain (INF384H)         Social Bookmarking              Paper Presentation   27 / 51
URLs    Result 5 (Positive)




                        Figure: Result 5




Ashish Jain (INF384H)    Social Bookmarking              Paper Presentation   28 / 51
URLs    Result 8 (Negative)




Aim
How many URLs are posted to del.icio.us every day?




   Ashish Jain (INF384H)     Social Bookmarking              Paper Presentation   29 / 51
URLs    Result 8 (Negative)




Methodology
    Plot the posts for every hour in Dataset M and compare the same
    with data collected by Philipp Keller a . The two are mutually
    reinforcing.
       Also plot posts from dataset R.
  a
      http://deli.ckoma.net/stats (Defunct website)




      Ashish Jain (INF384H)           Social Bookmarking              Paper Presentation   30 / 51
URLs       Result 8 (Negative)




Results
    About 92,000 posts per day of each weekend
       About 133,000 posts per weekday
       Implies about 851,000 posts per week
       About 44 million posts per year         a

  a
      There are about 1.5 million blog posts per day


Conclusion
    Compared to blog posts, the number of posts per day is small about
    1/10
       Posting rate on del.icio.us is marked by a series of increases followed
       by periods of relative stability.



      Ashish Jain (INF384H)           Social Bookmarking                 Paper Presentation   31 / 51
URLs    Result 9 (Negative)




Aim
What is the size of del.icio.us ?




    Ashish Jain (INF384H)           Social Bookmarking              Paper Presentation   32 / 51
URLs    Result 9 (Negative)




Methodology
        Divide time into three sets.
                        t1 Period before Schacter’s announcement on May 24th a
                        t2 May 24th and start of Philipp Keller’s data gathering
                        t3 Start of Philipp Keller’s data gathering to the present
        t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f )
        Equal to about 117 million posts b
        Reasonable estimate should be between 60 and 150 million posts.c
        Estimate between 20 and 50 percent of posts are unique URLs.
   a
     Joshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were
400,000 posts and 200,000 URLs.
   b
     Most likely an overestimate as the authors chose upper bound values for db and dk .
   c
     It does not include private posts




       Ashish Jain (INF384H)             Social Bookmarking              Paper Presentation   33 / 51
URLs    Result 9 (Negative)




Results
    There are about 115 million public posts a .
     There are about 30-50 million unique URLs.
   a
     They estimate that there are between 60 and 150 million posts. 115 million is not
an average of 60 and 150 million!


Conclusion
The number of total posts is relatively small compared to the web as a
whole.




    Ashish Jain (INF384H)            Social Bookmarking              Paper Presentation   34 / 51
URLs    Result 9 (Negative)




                        Figure: Result 9




Ashish Jain (INF384H)    Social Bookmarking              Paper Presentation   35 / 51
Tags    Result 6 (Positive)




Aim
Is there any correlation between tags and queries?




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   36 / 51
Tags    Result 6 (Positive)




Methodology
    Checked the tag-query overlap between the tags in dataset M and the
    query terms in the AOL query dataset.
    22% of the AOL query dataset is made up of queries. Removed those.
    Removed certain stop word like tags from dataset M.
    Plotted number of times a tag occurs in Dataset M versus the
    number of times it occurs in the AOL query dataset.




   Ashish Jain (INF384H)     Social Bookmarking              Paper Presentation   37 / 51
Tags    Result 6 (Positive)




Figure: A scatter plot of tag count versus query count for top tags and queries in
del.icio.us and AOL query dataset


    Ashish Jain (INF384H)         Social Bookmarking              Paper Presentation   38 / 51
Tags    Result 6 (Positive)




Results
    One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%,
    36.8% of these non-domain, non-URL queries.

Conclusion
del.icio.us may be able to help with queries where tags overlap with query
terms.




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   39 / 51
Tags    Result 7 (Positive)




Aim
Are the tags in del.icio.us of good quality? Are they non-sensical tags like
“cool”, “fi32”, etc.




    Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   40 / 51
Tags    Result 7 (Positive)




Methodology: User Study
    10 people (graduate students and “mix of individuals associated with
    our department”) manually evaluate posts to determine their quality.
    Sampled one post out of every five hundred, and then gave blocks of
    posts for individuals to label.
    Most individuals labeled 100 to 150 posts.
    For each tag, we asked whether the tag was “relevant”, “applies to
    the whole document,” and/or “subjective.”
    Bar for relevance was set low: whether a random person would agree
    that it was reasonable to say that the tag described the page.




   Ashish Jain (INF384H)      Social Bookmarking              Paper Presentation   41 / 51
Tags    Result 7 (Positive)




Results
    Only about 7% were deemed subjective (less than one in twenty for
    all users)
    No “spam”

Conclusion
Tags on the whole are of good quality.




   Ashish Jain (INF384H)       Social Bookmarking              Paper Presentation   42 / 51
Tags    Result 10 (Negative)




Aim
Do people use tags which are not obvious from the context?




   Ashish Jain (INF384H)      Social Bookmarking               Paper Presentation   43 / 51
Tags    Result 10 (Negative)




Methodology
    Randomly pick 20,000 posts from Dataset M.
    Convert HTML to text. Also look at page text of pages that link to
    the URL in question (backlinks) and pages that are linked from the
    URL in question (forward links).
    Extract tokens. Check whether pages are in English or not.
    Lower case all tags and tokens.
    Compare




   Ashish Jain (INF384H)      Social Bookmarking               Paper Presentation   44 / 51
Tags    Result 10 (Negative)




Results
    50% of the time tag is in the page text
    16% of the time it is in the title itself
    20% of the time it’ll appear in three places: the page it annotates, at
    least one of its backlinks, at least one of its forward links.
    80% of the time, tags will appear in one of three places: the page, its
    backlinks, its forward links.
    The tags in the other 20% seem to be of lower quality: misspellings,
    confusing tagging schemes (food/dining).

Conclusion
    Most tags can be discovered by a search engine



   Ashish Jain (INF384H)         Social Bookmarking               Paper Presentation   45 / 51
Tags    Result 11 (Negative)




Aim
Are some domains strongly correlated with particular tags and vice-versa?




   Ashish Jain (INF384H)       Social Bookmarking               Paper Presentation   46 / 51
Tags    Result 11 (Negative)




Example




Table: This example lists the five hosts in Dataset C with the most URLs
annotated with the tag java.




    Ashish Jain (INF384H)        Social Bookmarking               Paper Presentation   47 / 51
Tags    Result 11 (Negative)




Methodology
    Used Dataset C which is highly biased towards popular URLs, tags
    and users. Therefore, the results of this experiment do not necessarily
    apply to del.icio.us as a whole.
    Build a simple binary classifier and see how it does.




                           Figure: Function for classification




   Ashish Jain (INF384H)             Social Bookmarking               Paper Presentation   48 / 51
Tags    Result 11 (Negative)




Result
Domains are often highly correlated with particular tags and vice-versa.

Conclusion
It may be more efficient to train librarians to label domains than to ask
users to tag pages.




   Ashish Jain (INF384H)       Social Bookmarking               Paper Presentation   49 / 51
Discussion

   Summary


Advantages
    Actively updated
    Prominent in search results
    Tags are relevant and objective

Disadvantages
    Small amount of data
    Tags in titles, page text, URLs
    Not good enough to be used by major search engines.




   Ashish Jain (INF384H)       Social Bookmarking    Paper Presentation   50 / 51
Discussion




Discussion
    Personalized search using del.icio.us bookmarks.
    I found the conclusions drawn in subsection Result 1 hard to believe.
    I found the conclusions drawn in subsection Result 5 hard to believe.
    I found the conclusions drawn in subsection Result 11 hard to believe.




   Ashish Jain (INF384H)       Social Bookmarking       Paper Presentation   51 / 51
Discussion


Heymann, Koutrika, and Garcia-Molina. 2008. Can Social
Bookmarking Improve Web Search? WSDM 2008.




Ashish Jain (INF384H)     Social Bookmarking     Paper Presentation   51 / 51

More Related Content

What's hot

The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...kaufmanmpbbjegmwn
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...disillusionedne46
 
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed SpaceGet 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed SpaceNikki DeMoville
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
An Introduction and Applications of DOI
An Introduction and Applications of DOIAn Introduction and Applications of DOI
An Introduction and Applications of DOINader Ale Ebrahim
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref
 
Share Scientific Data to Improve Research Visibility and Impact
Share Scientific Data to Improve Research Visibility and ImpactShare Scientific Data to Improve Research Visibility and Impact
Share Scientific Data to Improve Research Visibility and ImpactNader Ale Ebrahim
 
Computer study lesson - Internet Search (25 Mar 2020)
Computer study lesson - Internet Search (25 Mar 2020)Computer study lesson - Internet Search (25 Mar 2020)
Computer study lesson - Internet Search (25 Mar 2020)wmsklang
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Ted Drake
 

What's hot (10)

The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed SpaceGet 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
An Introduction and Applications of DOI
An Introduction and Applications of DOIAn Introduction and Applications of DOI
An Introduction and Applications of DOI
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
 
Share Scientific Data to Improve Research Visibility and Impact
Share Scientific Data to Improve Research Visibility and ImpactShare Scientific Data to Improve Research Visibility and Impact
Share Scientific Data to Improve Research Visibility and Impact
 
Computer study lesson - Internet Search (25 Mar 2020)
Computer study lesson - Internet Search (25 Mar 2020)Computer study lesson - Internet Search (25 Mar 2020)
Computer study lesson - Internet Search (25 Mar 2020)
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
 

Viewers also liked

Ok mews catalogue
Ok mews catalogueOk mews catalogue
Ok mews catalogueokmews
 
아크로니스 전제품 제안서 폼
아크로니스 전제품 제안서 폼아크로니스 전제품 제안서 폼
아크로니스 전제품 제안서 폼kyoseok99
 
Christmas Gift Ideas Opening
Christmas Gift Ideas OpeningChristmas Gift Ideas Opening
Christmas Gift Ideas Openingchristmas2011
 
Bedrijfsprofiel voor kandidaten
Bedrijfsprofiel voor kandidatenBedrijfsprofiel voor kandidaten
Bedrijfsprofiel voor kandidatenTelelinQ
 
TGT Company Overview_Apr_2016
TGT Company Overview_Apr_2016TGT Company Overview_Apr_2016
TGT Company Overview_Apr_2016Freddy Tse
 
Taj Template[1] Destination Wedding Package & Advertising Sug
Taj Template[1] Destination Wedding Package & Advertising SugTaj Template[1] Destination Wedding Package & Advertising Sug
Taj Template[1] Destination Wedding Package & Advertising SugLali14
 
EAP παρουσίαση πτυχιακής εργασίας
EAP παρουσίαση πτυχιακής εργασίας EAP παρουσίαση πτυχιακής εργασίας
EAP παρουσίαση πτυχιακής εργασίας Michalis Skrimpas
 

Viewers also liked (14)

王晓丹 Ppt
王晓丹 Ppt王晓丹 Ppt
王晓丹 Ppt
 
Lend me your ears
Lend me your earsLend me your ears
Lend me your ears
 
Ok mews catalogue
Ok mews catalogueOk mews catalogue
Ok mews catalogue
 
Keepitpositivegarland
KeepitpositivegarlandKeepitpositivegarland
Keepitpositivegarland
 
Asteroïden
AsteroïdenAsteroïden
Asteroïden
 
王晓丹 Ppt
王晓丹 Ppt王晓丹 Ppt
王晓丹 Ppt
 
Wangxiaodan ppt
Wangxiaodan pptWangxiaodan ppt
Wangxiaodan ppt
 
아크로니스 전제품 제안서 폼
아크로니스 전제품 제안서 폼아크로니스 전제품 제안서 폼
아크로니스 전제품 제안서 폼
 
Christmas Gift Ideas Opening
Christmas Gift Ideas OpeningChristmas Gift Ideas Opening
Christmas Gift Ideas Opening
 
Bedrijfsprofiel voor kandidaten
Bedrijfsprofiel voor kandidatenBedrijfsprofiel voor kandidaten
Bedrijfsprofiel voor kandidaten
 
TGT Company Overview_Apr_2016
TGT Company Overview_Apr_2016TGT Company Overview_Apr_2016
TGT Company Overview_Apr_2016
 
Taj Template[1] Destination Wedding Package & Advertising Sug
Taj Template[1] Destination Wedding Package & Advertising SugTaj Template[1] Destination Wedding Package & Advertising Sug
Taj Template[1] Destination Wedding Package & Advertising Sug
 
EAP παρουσίαση πτυχιακής εργασίας
EAP παρουσίαση πτυχιακής εργασίας EAP παρουσίαση πτυχιακής εργασίας
EAP παρουσίαση πτυχιακής εργασίας
 
Phonetic Chart
 Phonetic Chart Phonetic Chart
Phonetic Chart
 

Similar to Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)

Zhishi.me - Weaving Chinese Linking Open Data
Zhishi.me - Weaving Chinese Linking Open DataZhishi.me - Weaving Chinese Linking Open Data
Zhishi.me - Weaving Chinese Linking Open DataXing Niu
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...NALESVPMEngg
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Emily Kolvitz
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebIOSR Journals
 
Design the Search Experience
Design the Search ExperienceDesign the Search Experience
Design the Search ExperienceMarianne Sweeny
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customersrichwig
 
End of Term Harvest User Interface
End of Term Harvest User Interface End of Term Harvest User Interface
End of Term Harvest User Interface misstracyjo
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptSamuelKetema1
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query UnderstandingAbhay Prakash
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)amytaylor
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Marianne Sweeny
 

Similar to Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/) (20)

Zhishi.me - Weaving Chinese Linking Open Data
Zhishi.me - Weaving Chinese Linking Open DataZhishi.me - Weaving Chinese Linking Open Data
Zhishi.me - Weaving Chinese Linking Open Data
 
From federated to aggregated search
From federated to aggregated searchFrom federated to aggregated search
From federated to aggregated search
 
Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...Information retrieval is the process of accessing data resources. Usually doc...
Information retrieval is the process of accessing data resources. Usually doc...
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...Structured data and metadata evaluation methodology for organizations looking...
Structured data and metadata evaluation methodology for organizations looking...
 
Sub1579
Sub1579Sub1579
Sub1579
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
Design the Search Experience
Design the Search ExperienceDesign the Search Experience
Design the Search Experience
 
Search Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your CustomersSearch Analytics: Conversations with Your Customers
Search Analytics: Conversations with Your Customers
 
End of Term Harvest User Interface
End of Term Harvest User Interface End of Term Harvest User Interface
End of Term Harvest User Interface
 
Web mining
Web miningWeb mining
Web mining
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)User-Friendly Database Interface Design (804)
User-Friendly Database Interface Design (804)
 
Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3Sweeny ux-seo om-cap 2014_v3
Sweeny ux-seo om-cap 2014_v3
 
SLA Summer 2008
SLA Summer 2008SLA Summer 2008
SLA Summer 2008
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)

  • 1. Can Social Bookmarking Improve Web Search? Ashish Jain Information Retrieval Paper Presentation
  • 2. Outline 1 Introduction 2 Terminology 3 Collection of Data 4 Related Work 5 URLs Result 1 (Positive) Result 2 (Positive) Result 3 (Positive) Result 4 (Positive) Result 5 (Positive) Result 8 (Negative) Result 9 (Negative) 6 Tags Result 6 (Positive) Result 7 (Positive) Result 10 (Negative) Result 11 (Negative) 7 Discussion
  • 3. Introduction What is social bookmarking? Show video (http://www.commoncraft.com/video/social-bookmarking). Ashish Jain (INF384H) Social Bookmarking Paper Presentation 3 / 51
  • 4. Introduction Figure: Major types of data used by search engines Ashish Jain (INF384H) Social Bookmarking Paper Presentation 4 / 51
  • 5. Introduction What information does del.icio.us have? Lots of < url, tag , user > tuples. How can del.icio.us information help a search engine? If the URLs are unknown to a search engine, they can be added to the list of URLs to be crawled. Vocabulary problem: Users use different words to refer to the same information. For example, a user searching for pain killers might enter the query “analgesic”. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 5 / 51
  • 6. Introduction Possibilities Suppose K represents known to a search engine and U represents unknown to a search engine. Tags (K) Tags (U) URLs (K) Both known Tags unknown URLs (U) URLs unknown Both Tags and URLs unknown When will del.icio.us information be useful to a search engine? When the URLs of del.icio.us is not a subset of the URLs crawled by a search engine. Tags given to a particular web page are not present in the URL, title, content of a web page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 6 / 51
  • 7. Introduction Authors are trying to find answers to the following questions: How often do we find “non-obvious” tags? Is del.icio.us really more up-to-date than a search engine? What coverage does delicious have of the web? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 7 / 51
  • 8. Terminology Definitions Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i has tagged URL k with tag j. Post A post is a URL bookmarked by a user and the associated meta data. A post is made up of many triples, though it may contained information like a user comment. Label A label is a < tagi , urlk > pair that signifies that at least one triple containing tag i and URL k exists in the system. Host Full host part of a URL example in http://i.stanford.edu/index.html, i.stanford.edu is the host. Domain Institutional level part of the host example in http://i.stanford.edu/index.html, stanford.edu is the domain. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 8 / 51
  • 9. Collection of Data Possible Sources Del.icio.us Interfaces “Recent” feed provides the most recent bookmarks posted to del.icio.us in real time All posts for a given URL All posts by a given user Most recent posts with a given tag Crawl Alternatively, one can crawl del.icio.us treating it as a tripartite graph of users, URLs and tags. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 9 / 51
  • 10. Collection of Data Datasets (C)rawl (R)ecent (M)onth Large scale crawl of Data gathered using Data gathered from del.icio.us in del.icio.us recent feed del.icio.us recent feed September 2006. interface for nearly 8 interface for one months beginning complete month September 28, 2006. starting May 25, 2007. Gathering process enhanced so more accurate than the R dataset. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 10 / 51
  • 11. Collection of Data Comparison (C)rawl (R)ecent (M)onth Posts ≈ 22M ≈ 11M ≈ 3.6M Unique URLs ≈ 1.3M ≈ 3M ≈ 2.5M Disadvantage Biased towards Missing data Missing data popular URLs, tags, users Ashish Jain (INF384H) Social Bookmarking Paper Presentation 11 / 51
  • 12. Collection of Data Query Dataset AOL Query Dataset About 20 million search queries by roughly 650,000 users Used to simulate distribution of queries that a search engine might receive. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 12 / 51
  • 13. URLs Figure: Overview Ashish Jain (INF384H) Social Bookmarking Paper Presentation 13 / 51
  • 14. URLs Result 1 (Positive) Result 1 Aim Are pages posted to del.icio.us often recently modified? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 14 / 51
  • 15. URLs Result 1 (Positive) Methodology Modification Date of a Web page As we studied in previous papers, determining the exact modification date of a web page is hard. The search engines have to estimate the modification date of a web page in order to crawl the web efficiently. Yahoo! Search API gives the modification date of a web page. Authors use the same to determine the modification date of a web page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 15 / 51
  • 16. URLs Result 1 (Positive) Methodology Compare del.icio.us Pages sampled from del.icio.us recent feed as they were posted Yahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of Yahoo! searches for queries sampled from the AOL query dataset. ODP Pages sampled from the Open Directory Project (dmoz.org) Ashish Jain (INF384H) Social Bookmarking Paper Presentation 16 / 51
  • 17. URLs Result 1 (Positive) Results Pages from del.icio.us are often more recently modified than ODP Found a correlation between a search result being ranked higher and a result having been modified more recently. Top 10 results from Yahoo! Search were about the same age as the pages found bookmarked in del.icio.us . Conclusion del.icio.us users post interesting pages that are actively updated or have been recently created. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 17 / 51
  • 18. URLs Result 2 (Positive) Result 2 Aim How many pages belonging to del.icio.us are not known to a search engine? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 18 / 51
  • 19. URLs Result 2 (Positive) Methodology Sample pages from the del.icio.us feed as they were posted, and then run searches on those pages immediately after. Of those pages, about 42.5% were not found. This could be due to several reasons: Page is indexed under another canonicalized URL Could be spam Could be an odd MIME-type for example an image Page could not have been found yet Continuously search for the web page in the next four weeks. If found assume it was not indexed. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 19 / 51
  • 20. URLs Result 2 (Positive) Result Out of 5,724 URLS which were sampled and were missing, 1,750 were later found. Implies roughly 30% of the missing URLs were new URLs. Implies 12.5% of del.icio.us i.e. 42.5% × 30%. Conclusion del.icio.us can serve as a (small) data source for new web pages and to help crawl ordering. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 20 / 51
  • 21. URLs Result 2 (Positive) Figure: Result 2 Ashish Jain (INF384H) Social Bookmarking Paper Presentation 21 / 51
  • 22. URLs Result 3 (Positive) Aim Check coverage of search results by del.icio.us Ashish Jain (INF384H) Social Bookmarking Paper Presentation 22 / 51
  • 23. URLs Result 3 (Positive) Methodology Sample queries from AOL dataset based on query event frequency (Implies biased towards popular queries). Run query on Yahoo! Search Intersect search results with datasets C, M, R. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 23 / 51
  • 24. URLs Result 3 (Positive) Results For the top 100 results, del.icio.us covers 9% of the results returned for a set of over 30,000 queries. For the top 10 results, del.icio.us covers 19% of the results returned. Conclusion del.icio.us users are disproportionately common in search results compared to their coverage. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 24 / 51
  • 25. URLs Result 4 (Positive) Q. Are there some subset of users responsible for most of the data in del.icio.us ? On social news sites, it is commonly cited that the majority of front page posts come from a dedicated group of less than 100 users. del.icio.us does exhibit some of these traits but it is not as dependent on some relatively small group of users. The top 10% only account for 56% of the posts. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 25 / 51
  • 26. URLs Result 4 (Positive) Figure: Result 4 Ashish Jain (INF384H) Social Bookmarking Paper Presentation 26 / 51
  • 27. URLs Result 5 (Positive) How much of the new information added to del.icio.us is new? Estimated using dataset M. A new post in dataset M was not in del.icio.us 40% of the time. Should be about 30% after adjusting for filtering (How did they come up with this number is not known!) How often is a completely new domain added to del.icio.us? 12% of posts in Dataset M were URLs whose domains were not in either Dataset C or R. Implies about 1/8th of the time Ashish Jain (INF384H) Social Bookmarking Paper Presentation 27 / 51
  • 28. URLs Result 5 (Positive) Figure: Result 5 Ashish Jain (INF384H) Social Bookmarking Paper Presentation 28 / 51
  • 29. URLs Result 8 (Negative) Aim How many URLs are posted to del.icio.us every day? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 29 / 51
  • 30. URLs Result 8 (Negative) Methodology Plot the posts for every hour in Dataset M and compare the same with data collected by Philipp Keller a . The two are mutually reinforcing. Also plot posts from dataset R. a http://deli.ckoma.net/stats (Defunct website) Ashish Jain (INF384H) Social Bookmarking Paper Presentation 30 / 51
  • 31. URLs Result 8 (Negative) Results About 92,000 posts per day of each weekend About 133,000 posts per weekday Implies about 851,000 posts per week About 44 million posts per year a a There are about 1.5 million blog posts per day Conclusion Compared to blog posts, the number of posts per day is small about 1/10 Posting rate on del.icio.us is marked by a series of increases followed by periods of relative stability. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 31 / 51
  • 32. URLs Result 9 (Negative) Aim What is the size of del.icio.us ? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 32 / 51
  • 33. URLs Result 9 (Negative) Methodology Divide time into three sets. t1 Period before Schacter’s announcement on May 24th a t2 May 24th and start of Philipp Keller’s data gathering t3 Start of Philipp Keller’s data gathering to the present t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f ) Equal to about 117 million posts b Reasonable estimate should be between 60 and 150 million posts.c Estimate between 20 and 50 percent of posts are unique URLs. a Joshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were 400,000 posts and 200,000 URLs. b Most likely an overestimate as the authors chose upper bound values for db and dk . c It does not include private posts Ashish Jain (INF384H) Social Bookmarking Paper Presentation 33 / 51
  • 34. URLs Result 9 (Negative) Results There are about 115 million public posts a . There are about 30-50 million unique URLs. a They estimate that there are between 60 and 150 million posts. 115 million is not an average of 60 and 150 million! Conclusion The number of total posts is relatively small compared to the web as a whole. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 34 / 51
  • 35. URLs Result 9 (Negative) Figure: Result 9 Ashish Jain (INF384H) Social Bookmarking Paper Presentation 35 / 51
  • 36. Tags Result 6 (Positive) Aim Is there any correlation between tags and queries? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 36 / 51
  • 37. Tags Result 6 (Positive) Methodology Checked the tag-query overlap between the tags in dataset M and the query terms in the AOL query dataset. 22% of the AOL query dataset is made up of queries. Removed those. Removed certain stop word like tags from dataset M. Plotted number of times a tag occurs in Dataset M versus the number of times it occurs in the AOL query dataset. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 37 / 51
  • 38. Tags Result 6 (Positive) Figure: A scatter plot of tag count versus query count for top tags and queries in del.icio.us and AOL query dataset Ashish Jain (INF384H) Social Bookmarking Paper Presentation 38 / 51
  • 39. Tags Result 6 (Positive) Results One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%, 36.8% of these non-domain, non-URL queries. Conclusion del.icio.us may be able to help with queries where tags overlap with query terms. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 39 / 51
  • 40. Tags Result 7 (Positive) Aim Are the tags in del.icio.us of good quality? Are they non-sensical tags like “cool”, “fi32”, etc. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 40 / 51
  • 41. Tags Result 7 (Positive) Methodology: User Study 10 people (graduate students and “mix of individuals associated with our department”) manually evaluate posts to determine their quality. Sampled one post out of every five hundred, and then gave blocks of posts for individuals to label. Most individuals labeled 100 to 150 posts. For each tag, we asked whether the tag was “relevant”, “applies to the whole document,” and/or “subjective.” Bar for relevance was set low: whether a random person would agree that it was reasonable to say that the tag described the page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 41 / 51
  • 42. Tags Result 7 (Positive) Results Only about 7% were deemed subjective (less than one in twenty for all users) No “spam” Conclusion Tags on the whole are of good quality. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 42 / 51
  • 43. Tags Result 10 (Negative) Aim Do people use tags which are not obvious from the context? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 43 / 51
  • 44. Tags Result 10 (Negative) Methodology Randomly pick 20,000 posts from Dataset M. Convert HTML to text. Also look at page text of pages that link to the URL in question (backlinks) and pages that are linked from the URL in question (forward links). Extract tokens. Check whether pages are in English or not. Lower case all tags and tokens. Compare Ashish Jain (INF384H) Social Bookmarking Paper Presentation 44 / 51
  • 45. Tags Result 10 (Negative) Results 50% of the time tag is in the page text 16% of the time it is in the title itself 20% of the time it’ll appear in three places: the page it annotates, at least one of its backlinks, at least one of its forward links. 80% of the time, tags will appear in one of three places: the page, its backlinks, its forward links. The tags in the other 20% seem to be of lower quality: misspellings, confusing tagging schemes (food/dining). Conclusion Most tags can be discovered by a search engine Ashish Jain (INF384H) Social Bookmarking Paper Presentation 45 / 51
  • 46. Tags Result 11 (Negative) Aim Are some domains strongly correlated with particular tags and vice-versa? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 46 / 51
  • 47. Tags Result 11 (Negative) Example Table: This example lists the five hosts in Dataset C with the most URLs annotated with the tag java. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 47 / 51
  • 48. Tags Result 11 (Negative) Methodology Used Dataset C which is highly biased towards popular URLs, tags and users. Therefore, the results of this experiment do not necessarily apply to del.icio.us as a whole. Build a simple binary classifier and see how it does. Figure: Function for classification Ashish Jain (INF384H) Social Bookmarking Paper Presentation 48 / 51
  • 49. Tags Result 11 (Negative) Result Domains are often highly correlated with particular tags and vice-versa. Conclusion It may be more efficient to train librarians to label domains than to ask users to tag pages. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 49 / 51
  • 50. Discussion Summary Advantages Actively updated Prominent in search results Tags are relevant and objective Disadvantages Small amount of data Tags in titles, page text, URLs Not good enough to be used by major search engines. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 50 / 51
  • 51. Discussion Discussion Personalized search using del.icio.us bookmarks. I found the conclusions drawn in subsection Result 1 hard to believe. I found the conclusions drawn in subsection Result 5 hard to believe. I found the conclusions drawn in subsection Result 11 hard to believe. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51
  • 52. Discussion Heymann, Koutrika, and Garcia-Molina. 2008. Can Social Bookmarking Improve Web Search? WSDM 2008. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51