SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Retrieval and Feedback Models
      for Blog Distillation


  CMU at the TREC 2007 Blog Track

    Jonathan Elsas, Jaime Arguello,
     Jamie Callan, Jaime Carbonell
CMU’s Blog Distillation Focus
• Two Research Questions:
  What is the appropriate unit of retrieval?
   Feeds or Entries?
  How can we effectively do pseudo-relevance
   feedback for Blog Distillation?


• Our four submissions investigate these
  two dimensions.
Outline
• Corpus Preprocessing Tasks
• Two Feedback Models
• Two Retrieval Models
• Results
• Discussion
Corpus Preprocessing
• Used only FEED documents (vs. PERMALINK
  or HOMEPAGE documents).
• For each FEEDNO, extracted each new entry
  across all feed fetches
• Aggregated into 1-document-per-FEEDNO,
  retaining structural elements from the FEED
  documents:
  Title, description, entry, entry title, etc…
• Very helpful: Python Universal Feed Parser
Two Pseudo-Relevance
        Feedback Models
• Indri’s Built in Pseudo-Relevance Feedback
  (Lavrenko’s Relevance Model)
  – Using Metzler’s Dependence Model query on the
    full feed documents
  – Produces weighted unigram PRF query,
             QRM = #weight( w1 t1 w2 t2 … w50 t50)


• Wikipedia-based Pseudo-Relevance Feedback
  – Focus on anchor text linking to highly ranked
    documents wrt baseline BOW query
Wikipedia-based PRF
            Wikipedia




BOW query




                  …
Wikipedia-based PRF
                Wikipedia



Relevance Set
(top T)


BOW query


                            Working Set
                            (top N)



                      …
Wikipedia-based PRF
                     Wikipedia



Relevance Set
(top T)




                                 Working Set
                                 (top N)



                           …
Wikipedia-based PRF
                              Wikipedia



Relevance Set                                       Extract anchor text from
                                                    Working Set that point to
(top T)                                             the Relevance Set.




                                                         Working Set
                                                         (top N)
       Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20)

                where wi ~ sum( T …
                                  - rank(target(ai)) )
Wikipedia-based PRF




         Query: “Apple iPod”
Two Pseudo-Relevance
  Feedback Models
    Q 983 “Photography”
  Wikipedia PRF         Indri’s Relevance Model
        photography     photography
       photographer     aerial
       depth of field    digital
              camera    full
         photograph     resource
       pulitzer prize   stock
      digital camera    free
   photographic film     information
    photojournalism     art
    cinematography      wedding
       shutter speed    great
Two Pseudo-Relevance
      Feedback Models
           Q 995 “Ruby on Rails”
        Wikipedia PRF         Indri’s Relevance Model
                   pokmon     rail
                       ruby   ruby
  pokmon ruby and sapphire    kakutani
            pokmon emerald    que
 ruby programming language    article
                        php   weblog
pokmon firered and leafgreen   activestate
         pokmon adventures    rubyonrail
       pokmon video games     new
             standard gauge   develop
                       tram   dontstopmusic
Two Retrieval Models
• Large Document model
  – Entire Feed is the unit of retrieval
• Small Document model
  – Individual entry is the unit of retrieval
  – Ranked Entries are aggregated into a Feed
    Ranking
Large Document model
Feed Document
Collection



     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection                   Ranked Feeds

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Small Document model
  Entry Document
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document             Q
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection                               Ranked Feeds
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
Large Document Indri Query:
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
• Features used in the small document
  model :
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Features used in the small document
  model :
Small Document Indri Query:
  – P(Feed | Q        )
               entry-text

  – P(Feed | QRM)
  – P(Feed | Qwiki)
Parameter Setting
• Selecting feature weights (λ’s) required
  training data
• Relevance judgments produced for a
  small subset of the queries (6+2)
  BOW title query, 50 docs judged/query
• Simple grid search to choose parameters
  that maximized MAP
Results




Our best run:
 Wikipedia expansion + Large Doc.
Results




Queries (ordered by CMUfeedW performance)
Feature Elimination
CMUfeedW




                    CMUfeed
Conclusions
• What worked
  –   Preprocessing the corpus & using only feed XML
  –   Simple retrieval model with appropriate features
  –   Wikipedia expansion
  –   (small amount of) training data
• What didn’t       (… or what we tried without success)
  – Small Document Model (but we think it can)
  – Spam/Splog detection, anchor text, URL
    segmentation
Thank You
Feature Elimination
                       +Training   -Training
+All (CMUfeedW)         0.3695      0.3536

-Dependence             0.3676      0.3457
Model
-Wiki (CMUfeed)         0.3385      0.3152

-Wiki, -Indri’s PRF                 0.2989

-Wiki, -Indri’s PRF,                0.2907
-Dep Model

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

JSON-B for CZJUG
JSON-B for CZJUGJSON-B for CZJUG
JSON-B for CZJUG
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Java EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontJava EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web front
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Message passing
Message passingMessage passing
Message passing
 
Configure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaConfigure Your Projects with Apache Tamaya
Configure Your Projects with Apache Tamaya
 

Ähnlich wie CMU TREC 2007 Blog Track

Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
Stefane Fermigier
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Tatsuhiko Miyagawa
 

Ähnlich wie CMU TREC 2007 Blog Track (20)

Jstl Guide
Jstl GuideJstl Guide
Jstl Guide
 
Decoupling Official Statistics
Decoupling Official StatisticsDecoupling Official Statistics
Decoupling Official Statistics
 
Modern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettext
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeDeploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client Development
 
Qtp online training
Qtp online trainingQtp online training
Qtp online training
 
Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - Deployment
 
WebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryWebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All Revolutionary
 
Java Annotations and Pre-processing
Java  Annotations and Pre-processingJava  Annotations and Pre-processing
Java Annotations and Pre-processing
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

CMU TREC 2007 Blog Track

  • 1. Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell
  • 2. CMU’s Blog Distillation Focus • Two Research Questions: What is the appropriate unit of retrieval? Feeds or Entries? How can we effectively do pseudo-relevance feedback for Blog Distillation? • Our four submissions investigate these two dimensions.
  • 3. Outline • Corpus Preprocessing Tasks • Two Feedback Models • Two Retrieval Models • Results • Discussion
  • 4. Corpus Preprocessing • Used only FEED documents (vs. PERMALINK or HOMEPAGE documents). • For each FEEDNO, extracted each new entry across all feed fetches • Aggregated into 1-document-per-FEEDNO, retaining structural elements from the FEED documents: Title, description, entry, entry title, etc… • Very helpful: Python Universal Feed Parser
  • 5. Two Pseudo-Relevance Feedback Models • Indri’s Built in Pseudo-Relevance Feedback (Lavrenko’s Relevance Model) – Using Metzler’s Dependence Model query on the full feed documents – Produces weighted unigram PRF query, QRM = #weight( w1 t1 w2 t2 … w50 t50) • Wikipedia-based Pseudo-Relevance Feedback – Focus on anchor text linking to highly ranked documents wrt baseline BOW query
  • 6. Wikipedia-based PRF Wikipedia BOW query …
  • 7. Wikipedia-based PRF Wikipedia Relevance Set (top T) BOW query Working Set (top N) …
  • 8. Wikipedia-based PRF Wikipedia Relevance Set (top T) Working Set (top N) …
  • 9. Wikipedia-based PRF Wikipedia Relevance Set Extract anchor text from Working Set that point to (top T) the Relevance Set. Working Set (top N) Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20) where wi ~ sum( T … - rank(target(ai)) )
  • 10. Wikipedia-based PRF Query: “Apple iPod”
  • 11. Two Pseudo-Relevance Feedback Models Q 983 “Photography” Wikipedia PRF Indri’s Relevance Model photography photography photographer aerial depth of field digital camera full photograph resource pulitzer prize stock digital camera free photographic film information photojournalism art cinematography wedding shutter speed great
  • 12. Two Pseudo-Relevance Feedback Models Q 995 “Ruby on Rails” Wikipedia PRF Indri’s Relevance Model pokmon rail ruby ruby pokmon ruby and sapphire kakutani pokmon emerald que ruby programming language article php weblog pokmon firered and leafgreen activestate pokmon adventures rubyonrail pokmon video games new standard gauge develop tram dontstopmusic
  • 13. Two Retrieval Models • Large Document model – Entire Feed is the unit of retrieval • Small Document model – Individual entry is the unit of retrieval – Ranked Entries are aggregated into a Feed Ranking
  • 14. Large Document model Feed Document Collection <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 15. Large Document model Feed Document Collection Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 16. Large Document model Feed Document Collection Ranked Feeds Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 17. Small Document model Entry Document Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 18. Small Document model Entry Document Q Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 19. Small Document model Entry Document Q Collection Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 20. Small Document model Entry Document Q Collection Ranked Feeds Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 21. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 22. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document Large Document Indri Query: model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 23. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 24. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 25. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art federated search algorithm
  • 26. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 27. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 28. Small Document Model • Features used in the small document model : – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 29. Small Document Model • Features used in the small document model : Small Document Indri Query: – P(Feed | Q ) entry-text – P(Feed | QRM) – P(Feed | Qwiki)
  • 30. Parameter Setting • Selecting feature weights (λ’s) required training data • Relevance judgments produced for a small subset of the queries (6+2) BOW title query, 50 docs judged/query • Simple grid search to choose parameters that maximized MAP
  • 31. Results Our best run: Wikipedia expansion + Large Doc.
  • 32. Results Queries (ordered by CMUfeedW performance)
  • 34. Conclusions • What worked – Preprocessing the corpus & using only feed XML – Simple retrieval model with appropriate features – Wikipedia expansion – (small amount of) training data • What didn’t (… or what we tried without success) – Small Document Model (but we think it can) – Spam/Splog detection, anchor text, URL segmentation
  • 36. Feature Elimination +Training -Training +All (CMUfeedW) 0.3695 0.3536 -Dependence 0.3676 0.3457 Model -Wiki (CMUfeed) 0.3385 0.3152 -Wiki, -Indri’s PRF 0.2989 -Wiki, -Indri’s PRF, 0.2907 -Dep Model