SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Adam Rae
Vanessa Murdock, Adrian Popescu, Hugues Bouchard
     SIGIR 2012, Portland, Oregon, Entities Session
!




    I’m at Adam’s
        Bar…




?

                        Mining the Web for
                         Points of Interest

          Using social media to increase our
                     knowledge of the world
Contents

§ Motivation

§ Point Of Interest (POI) extraction using user
   generated data

§ POI localisation using social media

§ Conclusions
Motivation
§ Geographic Points of Interest are valuable
   representations of important places in the world
   around us.

§ Browsing and search
   of POIs increasingly
   important
 ›    Web search
 ›    Mobile
 ›    Navigation
Where do POIs come from?

§ Editing listings coming from NMAs, commercial
   directories etc.
 ›    Costly process
 ›    Expensive to maintain freshness
 ›    Coverage
§ Do they reflect the kind of
   places that people are
   interested in looking for?
Can we get them from the web?
§ Un/semi-structured mentions of POIs throughout
   text on web
 ›    Lots of context

§ Structured mentions of POIs in micro blogging
   systems and Wikipedia articles
 ›    Easy to extract
When is a POI not a POI?

1  The White House is at 1600 Pennsylvania
   Avenue, Washington DC.

2  The White House released a statement today
   suggesting the moon is made of cheese.

3  The people living in the white house at the end
   of the street turned out to be Martians.
Europe According to Foursquare
The World According to Foursquare
The World According to Gowalla
The World According to Wikipedia
Can we bootstrap using social media?

§ Train Conditional Random Fields (CRF) using
   web snippets bootstrapped from structured
   mentions in micro-blog entries
 ›    Extract POI, use as query to search engine
 ›    Resultant snippets filtered to those that contain POI
 ›    Sanitise


§ Also from geocoded Wikipedia articles (according
   to Yago2)
Ground Truth Data
§ Created by manual assessors given explicit
   instructions
 ›    1,337 examples of POIs in (some) context
 ›    1,066 unique POIs
 ›    Inter-assessor agreement:

      Ground Truth   Precision     Recall        F-Measure
       Assessor
           1          0.749        0.792           0.770

           2          0.814        0.716           0.762
Sequential Tagging Model


                   1      $                 '
   p(Y | X, λ ) =      exp& ∑ λ j F j (Y, X))
                          &                 )
                  Z(X)    % j               (


           + 1
           -         %                 (/-
    argmaxΛ,      exp' ∑ λ j F j (Y, X)* 0
                     '                 *-
           - Z(X)
           .         & j               )1
Features
§ Lexical
 ›    Word identity, shape, position, etc.
§ Grammatical
 ›    Part of Speech, Apache OpenNLP
§ Statistical
 ›    Normalised Point-wise Mutual Information of mobile
      search query logs
§ Geographic
 ›    Gazetteer attributes from Yahoo! Placemaker
 ›    http://developer.yahoo.com/geo/placemaker/
Process Overview



                     Extract
Geocoded Wikipedia                                     Wikipedia Bootstrapped                                             Wikipedia based
                     Article
     Articles                                           Raw Web Snippets                                                    POI Tagger



                                Search Engine (Bing)




                                                                                                     CRF Model Training
                                                                                Snippet Processing
                      Titles

                                                             Foursquare                                                     Foursquare
     Check-Ins
                                                       Bootstrapped Raw Web                                                 based POI
   (Foursquare)
                      Extract                                 Snippets                                                        Tagger
                       POI
                     Mentions
    Check-Ins                                          Gowalla Bootstrapped                                               Gowalla based
    (Gowalla)                                           Raw Web Snippets                                                   POI Tagger




         … was only after he had left the Marriott Hotel that he
                            remembered…
Results

Training Data   Testing Data   Precision   Recall

Y! Placemaker Manual Data      0.237       0.228

Wikipedia       Manual Data    0.514       0.337
Foursquare      Manual Data    0.276       0.655
Gowalla         Manual Data    0.360       0.414
Wikipedia       10-fold CV     0.879       0.955
Foursquare      10-fold CV     0.689       0.468
Gowalla         10-fold CV     0.857       0.868
Language Modelling
§ Partition the world into 1km cells
§ For each, create model from Flickr photos taken
   in that area

               c user (t,L)
 P(t | θ L ) =                        L =    ∑c       user   (t i ,L)
                     L                       t i ∈L


§ Treat problem as IR, match a POI (query) against
   the cells (document)
 ›    Return centroid of of best matching cell
                      €
Performance


             Placemaker   Cascade   Geo Scope   # Examples
Placemaker   0.29         0.29      0.29        134
POIs
Placemaker   4.19         2.90      2.12        131
Other Locs
All Known    1.17         0.82      0.79        265
Locs
New          -            439.0     5.88        130
Locations
All Data     -            1.20      0.96        395
Conclusions and Implications

§  POIs are valuable, but useful ones difficult to define

§  Generating evaluation data is hard

§  Can use web snippets bootstrapped with
    check-ins, and articles on Wikipedia to train POI
    tagger
 ›    Up to 88% precision on unlabelled data
 ›    Reflect the POIs users visit
 ›    Easily updated
 ›    Can be located accurately using hybrid gazetteer + Flickr
      language model technique
Benefits of this approach
§ Discover POIs:
 ›    that we already know about (replace/extend existing
      sources)
 ›    we didn’t already know about (novel POIs)
 ›    of more diverse types (increasing coverage)
 ›    that are fresher


§ Increase relevance of local and hyperlocal search
   using wisdom of the crowds
Research Areas
-  Automatic POI detection in UGC
-  Learning how users refer to places
-  Localising media
-  Generating evaluation data
 -    (This is hard)
-  Multi-source combination
-  Quality & Credibility
Adam Rae
            adamrae@yahoo-inc.com
Thank you         Vanessa Murdock
                   Adrian Popescu
                  Hugues Bouchard

Weitere ähnliche Inhalte

Ähnlich wie Mining the Web for Points of Interest

Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrMongoDB
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013MongoDB
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...Amazon Web Services
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency TrapTom Croucher
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Abraham Aranguren
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware appsAnshu Prateek
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackElasticsearch
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗Bobby Chen
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -晋吾 北川
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without InterferenceTony Tam
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 

Ähnlich wie Mining the Web for Points of Interest (20)

Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale...
 
Mobile And The Latency Trap
Mobile And The Latency TrapMobile And The Latency Trap
Mobile And The Latency Trap
 
Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011Silent web app testing by example - BerlinSides 2011
Silent web app testing by example - BerlinSides 2011
 
Hacking up location aware apps
Hacking up location aware appsHacking up location aware apps
Hacking up location aware apps
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic Stack
 
HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗HiUED 前端/web 發展和體驗
HiUED 前端/web 發展和體驗
 
SIL rapid capture
SIL rapid captureSIL rapid capture
SIL rapid capture
 
Why Django
Why DjangoWhy Django
Why Django
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -FCIS - Fully Instance-aware Semantic Segmentation -
FCIS - Fully Instance-aware Semantic Segmentation -
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 

Kürzlich hochgeladen

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 

Kürzlich hochgeladen (20)

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 

Mining the Web for Points of Interest

  • 1. Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard SIGIR 2012, Portland, Oregon, Entities Session
  • 2. ! I’m at Adam’s Bar… ? Mining the Web for Points of Interest Using social media to increase our knowledge of the world
  • 3. Contents § Motivation § Point Of Interest (POI) extraction using user generated data § POI localisation using social media § Conclusions
  • 4. Motivation § Geographic Points of Interest are valuable representations of important places in the world around us. § Browsing and search of POIs increasingly important ›  Web search ›  Mobile ›  Navigation
  • 5. Where do POIs come from? § Editing listings coming from NMAs, commercial directories etc. ›  Costly process ›  Expensive to maintain freshness ›  Coverage § Do they reflect the kind of places that people are interested in looking for?
  • 6. Can we get them from the web? § Un/semi-structured mentions of POIs throughout text on web ›  Lots of context § Structured mentions of POIs in micro blogging systems and Wikipedia articles ›  Easy to extract
  • 7. When is a POI not a POI? 1  The White House is at 1600 Pennsylvania Avenue, Washington DC. 2  The White House released a statement today suggesting the moon is made of cheese. 3  The people living in the white house at the end of the street turned out to be Martians.
  • 8. Europe According to Foursquare
  • 9. The World According to Foursquare
  • 10. The World According to Gowalla
  • 11. The World According to Wikipedia
  • 12. Can we bootstrap using social media? § Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries ›  Extract POI, use as query to search engine ›  Resultant snippets filtered to those that contain POI ›  Sanitise § Also from geocoded Wikipedia articles (according to Yago2)
  • 13. Ground Truth Data § Created by manual assessors given explicit instructions ›  1,337 examples of POIs in (some) context ›  1,066 unique POIs ›  Inter-assessor agreement: Ground Truth Precision Recall F-Measure Assessor 1 0.749 0.792 0.770 2 0.814 0.716 0.762
  • 14. Sequential Tagging Model 1 $ ' p(Y | X, λ ) = exp& ∑ λ j F j (Y, X)) & ) Z(X) % j ( + 1 - % (/- argmaxΛ, exp' ∑ λ j F j (Y, X)* 0 ' *- - Z(X) . & j )1
  • 15. Features § Lexical ›  Word identity, shape, position, etc. § Grammatical ›  Part of Speech, Apache OpenNLP § Statistical ›  Normalised Point-wise Mutual Information of mobile search query logs § Geographic ›  Gazetteer attributes from Yahoo! Placemaker ›  http://developer.yahoo.com/geo/placemaker/
  • 16. Process Overview Extract Geocoded Wikipedia Wikipedia Bootstrapped Wikipedia based Article Articles Raw Web Snippets POI Tagger Search Engine (Bing) CRF Model Training Snippet Processing Titles Foursquare Foursquare Check-Ins Bootstrapped Raw Web based POI (Foursquare) Extract Snippets Tagger POI Mentions Check-Ins Gowalla Bootstrapped Gowalla based (Gowalla) Raw Web Snippets POI Tagger … was only after he had left the Marriott Hotel that he remembered…
  • 17. Results Training Data Testing Data Precision Recall Y! Placemaker Manual Data 0.237 0.228 Wikipedia Manual Data 0.514 0.337 Foursquare Manual Data 0.276 0.655 Gowalla Manual Data 0.360 0.414 Wikipedia 10-fold CV 0.879 0.955 Foursquare 10-fold CV 0.689 0.468 Gowalla 10-fold CV 0.857 0.868
  • 18. Language Modelling § Partition the world into 1km cells § For each, create model from Flickr photos taken in that area c user (t,L) P(t | θ L ) = L = ∑c user (t i ,L) L t i ∈L § Treat problem as IR, match a POI (query) against the cells (document) ›  Return centroid of of best matching cell €
  • 19. Performance Placemaker Cascade Geo Scope # Examples Placemaker 0.29 0.29 0.29 134 POIs Placemaker 4.19 2.90 2.12 131 Other Locs All Known 1.17 0.82 0.79 265 Locs New - 439.0 5.88 130 Locations All Data - 1.20 0.96 395
  • 20. Conclusions and Implications §  POIs are valuable, but useful ones difficult to define §  Generating evaluation data is hard §  Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger ›  Up to 88% precision on unlabelled data ›  Reflect the POIs users visit ›  Easily updated ›  Can be located accurately using hybrid gazetteer + Flickr language model technique
  • 21. Benefits of this approach § Discover POIs: ›  that we already know about (replace/extend existing sources) ›  we didn’t already know about (novel POIs) ›  of more diverse types (increasing coverage) ›  that are fresher § Increase relevance of local and hyperlocal search using wisdom of the crowds
  • 22. Research Areas -  Automatic POI detection in UGC -  Learning how users refer to places -  Localising media -  Generating evaluation data -  (This is hard) -  Multi-source combination -  Quality & Credibility
  • 23. Adam Rae adamrae@yahoo-inc.com Thank you Vanessa Murdock Adrian Popescu Hugues Bouchard

Hinweis der Redaktion

  1. What is a POI?POIs have names, locations, category, context (depends on envisaged use-case)A point of interest (POI) is a focused geographic entity such as a landmark, a school, an historical building, or a business.
  2. news articles from the U.S. and the U.K., but also included a small number of examples from Yahoo! Answers and a small number of queries submitted to a search engine.The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs. The inter-assessor agreement was 73.9%. In total 1,337 of the examples they annotated contained POIs, which yielded 1,066 unique POIs.
  3. Learn the set of feature weights (big) lambda which maximises the label sequence probabilityProbability of a label sequence y, given an observed sequence xZ normalising factorF(Y,X) is the set of feature functions computed over the observations and the label transitions.
  4. Up to ten snippets per queryUse BI0
  5. All three model are statistically significantly higher than baseline
  6. C_user(t,L) is the number of unique users who use the term ‘t’ in the cell ‘L’|L| is the sum of the user frequency of all terms in the locationMakes sense to use highly precise extant info when available, so use LM in combination with Placemaker (gazetteer) = cascade model
  7. Median distances in kilometres
  8. Re-finding existing POIs allows us to get get context from social media as well as confirm our model’s performanceNovel POIs are valuable, extending our knowledge of what is out thereNot restricted by the biases of existing sources like commercial enterprises or narrow criteria POIs
  9. Wild text : web snippets, Tweets, news, etc, varies in cleanliness and consistency depending on sourceAutomatically detecting POIs in UGC content(“Corner of forth and main”)Discussion on the subjective nature of POI/location etc, very application-dependant (How to evaluate discover tasks?) Discussion – open questionsLocalising them Talking about manual annotation data for POI detection(How hard is it for humans?)Analytics- Combinations of sources