SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Adam Rae
Vanessa Murdock, Adrian Popescu, Hugues Bouchard
     SIGIR 2012, Portland, Oregon, Entities Session
!




    I’m at Adam’s
        Bar…




?

                        Mining the Web for
                         Points of Interest

          Using social media to increase our
                     knowledge of the world
Contents

§ Motivation

§ Point Of Interest (POI) extraction using user
   generated data

§ POI localisation using social media

§ Conclusions
Motivation
§ Geographic Points of Interest are valuable
   representations of important places in the world
   around us.

§ Browsing and search
   of POIs increasingly
   important
 ›    Web search
 ›    Mobile
 ›    Navigation
Where do POIs come from?

§ Editing listings coming from NMAs, commercial
   directories etc.
 ›    Costly process
 ›    Expensive to maintain freshness
 ›    Coverage
§ Do they reflect the kind of
   places that people are
   interested in looking for?
Can we get them from the web?
§ Un/semi-structured mentions of POIs throughout
   text on web
 ›    Lots of context

§ Structured mentions of POIs in micro blogging
   systems and Wikipedia articles
 ›    Easy to extract
When is a POI not a POI?

1  The White House is at 1600 Pennsylvania
   Avenue, Washington DC.

2  The White House released a statement today
   suggesting the moon is made of cheese.

3  The people living in the white house at the end
   of the street turned out to be Martians.
Europe According to Foursquare
The World According to Foursquare
The World According to Gowalla
The World According to Wikipedia
Can we bootstrap using social media?

§ Train Conditional Random Fields (CRF) using
   web snippets bootstrapped from structured
   mentions in micro-blog entries
 ›    Extract POI, use as query to search engine
 ›    Resultant snippets filtered to those that contain POI
 ›    Sanitise


§ Also from geocoded Wikipedia articles (according
   to Yago2)
Ground Truth Data
§ Created by manual assessors given explicit
   instructions
 ›    1,337 examples of POIs in (some) context
 ›    1,066 unique POIs
 ›    Inter-assessor agreement:

      Ground Truth   Precision     Recall        F-Measure
       Assessor
           1          0.749        0.792           0.770

           2          0.814        0.716           0.762
Sequential Tagging Model


                   1      $                 '
   p(Y | X, λ ) =      exp& ∑ λ j F j (Y, X))
                          &                 )
                  Z(X)    % j               (


           + 1
           -         %                 (/-
    argmaxΛ,      exp' ∑ λ j F j (Y, X)* 0
                     '                 *-
           - Z(X)
           .         & j               )1
Features
§ Lexical
 ›    Word identity, shape, position, etc.
§ Grammatical
 ›    Part of Speech, Apache OpenNLP
§ Statistical
 ›    Normalised Point-wise Mutual Information of mobile
      search query logs
§ Geographic
 ›    Gazetteer attributes from Yahoo! Placemaker
 ›    http://developer.yahoo.com/geo/placemaker/
Process Overview



                     Extract
Geocoded Wikipedia                                     Wikipedia Bootstrapped                                             Wikipedia based
                     Article
     Articles                                           Raw Web Snippets                                                    POI Tagger



                                Search Engine (Bing)




                                                                                                     CRF Model Training
                                                                                Snippet Processing
                      Titles

                                                             Foursquare                                                     Foursquare
     Check-Ins
                                                       Bootstrapped Raw Web                                                 based POI
   (Foursquare)
                      Extract                                 Snippets                                                        Tagger
                       POI
                     Mentions
    Check-Ins                                          Gowalla Bootstrapped                                               Gowalla based
    (Gowalla)                                           Raw Web Snippets                                                   POI Tagger




         … was only after he had left the Marriott Hotel that he
                            remembered…
Results

Training Data   Testing Data   Precision   Recall

Y! Placemaker Manual Data      0.237       0.228

Wikipedia       Manual Data    0.514       0.337
Foursquare      Manual Data    0.276       0.655
Gowalla         Manual Data    0.360       0.414
Wikipedia       10-fold CV     0.879       0.955
Foursquare      10-fold CV     0.689       0.468
Gowalla         10-fold CV     0.857       0.868
Language Modelling
§ Partition the world into 1km cells
§ For each, create model from Flickr photos taken
   in that area

               c user (t,L)
 P(t | θ L ) =                        L =    ∑c       user   (t i ,L)
                     L                       t i ∈L


§ Treat problem as IR, match a POI (query) against
   the cells (document)
 ›    Return centroid of of best matching cell
                      €
Performance


             Placemaker   Cascade   Geo Scope   # Examples
Placemaker   0.29         0.29      0.29        134
POIs
Placemaker   4.19         2.90      2.12        131
Other Locs
All Known    1.17         0.82      0.79        265
Locs
New          -            439.0     5.88        130
Locations
All Data     -            1.20      0.96        395
Conclusions and Implications

§  POIs are valuable, but useful ones difficult to define

§  Generating evaluation data is hard

§  Can use web snippets bootstrapped with
    check-ins, and articles on Wikipedia to train POI
    tagger
 ›    Up to 88% precision on unlabelled data
 ›    Reflect the POIs users visit
 ›    Easily updated
 ›    Can be located accurately using hybrid gazetteer + Flickr
      language model technique
Benefits of this approach
§ Discover POIs:
 ›    that we already know about (replace/extend existing
      sources)
 ›    we didn’t already know about (novel POIs)
 ›    of more diverse types (increasing coverage)
 ›    that are fresher


§ Increase relevance of local and hyperlocal search
   using wisdom of the crowds
Research Areas
-  Automatic POI detection in UGC
-  Learning how users refer to places
-  Localising media
-  Generating evaluation data
 -    (This is hard)
-  Multi-source combination
-  Quality & Credibility
Adam Rae
            adamrae@yahoo-inc.com
Thank you         Vanessa Murdock
                   Adrian Popescu
                  Hugues Bouchard

Weitere ähnliche Inhalte

Ähnlich wie Mining the Web for Points of Interest

Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrMongoDB
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013MongoDB
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...Amazon Web Services
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency TrapTom Croucher
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Abraham Aranguren
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware appsAnshu Prateek
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackElasticsearch
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗Bobby Chen
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -晋吾 北川
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without InterferenceTony Tam
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 

Ähnlich wie Mining the Web for Points of Interest (20)

Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency Trap
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware apps
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic Stack
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗
 
SIL rapid capture
SIL rapid captureSIL rapid capture
SIL rapid capture
 
Why Django
Why DjangoWhy Django
Why Django
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Mining the Web for Points of Interest

  • 1. Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session
  • 2. ! I’m at Adam’s Bar… ? Mining the Web for Points of Interest Using social media to increase our knowledge of the world
  • 3. Contents § Motivation § Point Of Interest (POI) extraction using user generated data § POI localisation using social media § Conclusions
  • 4. Motivation § Geographic Points of Interest are valuable representations of important places in the world around us. § Browsing and search of POIs increasingly important ›  Web search ›  Mobile ›  Navigation
  • 5. Where do POIs come from? § Editing listings coming from NMAs, commercial directories etc. ›  Costly process ›  Expensive to maintain freshness ›  Coverage § Do they reflect the kind of places that people are interested in looking for?
  • 6. Can we get them from the web? § Un/semi-structured mentions of POIs throughout text on web ›  Lots of context § Structured mentions of POIs in micro blogging systems and Wikipedia articles ›  Easy to extract
  • 7. When is a POI not a POI? 1  The White House is at 1600 Pennsylvania Avenue, Washington DC. 2  The White House released a statement today suggesting the moon is made of cheese. 3  The people living in the white house at the end of the street turned out to be Martians.
  • 8. Europe According to Foursquare
  • 9. The World According to Foursquare
  • 10. The World According to Gowalla
  • 11. The World According to Wikipedia
  • 12. Can we bootstrap using social media? § Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries ›  Extract POI, use as query to search engine ›  Resultant snippets filtered to those that contain POI ›  Sanitise § Also from geocoded Wikipedia articles (according to Yago2)
  • 13. Ground Truth Data § Created by manual assessors given explicit instructions ›  1,337 examples of POIs in (some) context ›  1,066 unique POIs ›  Inter-assessor agreement: Ground Truth Precision Recall F-Measure Assessor 1 0.749 0.792 0.770 2 0.814 0.716 0.762
  • 14. Sequential Tagging Model 1 $ ' p(Y | X, λ ) = exp& ∑ λ j F j (Y, X)) & ) Z(X) % j ( + 1 - % (/- argmaxΛ, exp' ∑ λ j F j (Y, X)* 0 ' *- - Z(X) . & j )1
  • 15. Features § Lexical ›  Word identity, shape, position, etc. § Grammatical ›  Part of Speech, Apache OpenNLP § Statistical ›  Normalised Point-wise Mutual Information of mobile search query logs § Geographic ›  Gazetteer attributes from Yahoo! Placemaker ›  http://developer.yahoo.com/geo/placemaker/
  • 16. Process Overview Extract Geocoded Wikipedia Wikipedia Bootstrapped Wikipedia based Article Articles Raw Web Snippets POI Tagger Search Engine (Bing) CRF Model Training Snippet Processing Titles Foursquare Foursquare Check-Ins Bootstrapped Raw Web based POI (Foursquare) Extract Snippets Tagger POI Mentions Check-Ins Gowalla Bootstrapped Gowalla based (Gowalla) Raw Web Snippets POI Tagger … was only after he had left the Marriott Hotel that he remembered…
  • 17. Results Training Data Testing Data Precision Recall Y! Placemaker Manual Data 0.237 0.228 Wikipedia Manual Data 0.514 0.337 Foursquare Manual Data 0.276 0.655 Gowalla Manual Data 0.360 0.414 Wikipedia 10-fold CV 0.879 0.955 Foursquare 10-fold CV 0.689 0.468 Gowalla 10-fold CV 0.857 0.868
  • 18. Language Modelling § Partition the world into 1km cells § For each, create model from Flickr photos taken in that area c user (t,L) P(t | θ L ) = L = ∑c user (t i ,L) L t i ∈L § Treat problem as IR, match a POI (query) against the cells (document) ›  Return centroid of of best matching cell €
  • 19. Performance Placemaker Cascade Geo Scope # Examples Placemaker 0.29 0.29 0.29 134 POIs Placemaker 4.19 2.90 2.12 131 Other Locs All Known 1.17 0.82 0.79 265 Locs New - 439.0 5.88 130 Locations All Data - 1.20 0.96 395
  • 20. Conclusions and Implications §  POIs are valuable, but useful ones difficult to define §  Generating evaluation data is hard §  Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger ›  Up to 88% precision on unlabelled data ›  Reflect the POIs users visit ›  Easily updated ›  Can be located accurately using hybrid gazetteer + Flickr language model technique
  • 21. Benefits of this approach § Discover POIs: ›  that we already know about (replace/extend existing sources) ›  we didn’t already know about (novel POIs) ›  of more diverse types (increasing coverage) ›  that are fresher § Increase relevance of local and hyperlocal search using wisdom of the crowds
  • 22. Research Areas -  Automatic POI detection in UGC -  Learning how users refer to places -  Localising media -  Generating evaluation data -  (This is hard) -  Multi-source combination -  Quality & Credibility
  • 23. Adam Rae adamrae@yahoo-inc.com Thank you Vanessa Murdock Adrian Popescu Hugues Bouchard

Hinweis der Redaktion

  1. What is a POI?POIs have names, locations, category, context (depends on envisaged use-case)A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business.
  2. news articles from the U.S. and the U.K., but also included a small number of examples from Yahoo! Answers and a small number of queries submitted to a search engine.The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs. The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs.
  3. Learn the set of feature weights (big) lambda which maximises the label sequence probabilityProbability of a label sequence y, given an observed sequence xZ normalising factorF(Y,X) is the set of feature functions computed over the observations and the label transitions.
  4. Up to ten snippets per queryUse BI0
  5. All three model are statistically significantly higher than baseline
  6. C_user(t,L) is the number of unique users who use the term ‘t’ in the cell ‘L’|L| is the sum of the user frequency of all terms in the locationMakes sense to use highly precise extant info when available, so use LM in combination with Placemaker (gazetteer) = cascade model
  7. Median distances in kilometres
  8. Re-finding existing POIs allows us to get get context from social media as well as confirm our model’s performanceNovel POIs are valuable, extending our knowledge of what is out thereNot restricted by the biases of existing sources like commercial enterprises or narrow criteria POIs
  9. Wild text : web snippets, Tweets, news, etc, varies in cleanliness and consistency depending on sourceAutomatically detecting POIs in UGC content(“Corner of forth and main”)Discussion on the subjective nature of POI/location etc, very application-dependant (How to evaluate discover tasks?) Discussion – open questionsLocalising them Talking about manual annotation data for POI detection(How hard is it for humans?)Analytics- Combinations of sources