SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Towards Context-Aware Search and Analysis
                   on
           Social Media Data
                Leon Derczynski
                 Bin Yang 杨彬
               Christian S. Jensen
Evolution of communication

Functional utterances

Vowels

Velar closure: consonants

Speech

New modality: writing
                                Increased
Digital text
                                 machine-



                            ?
E-mail                           readable
Social media
                                information
Social Media = Big Data
Gartner ''3V'' definition:

1.Volume

2.Velocity

3.Variety

High volume & velocity of messages:

   Twitter has     ~20 000 000 users per month
   They write     ~500 000 000 messages per day

Massive variety:
  Stock markets;
  Earthquakes;
  Social arrangements;
  … Bieber
What is machine-readable now?
Messages now contain

-   not only linguistic content

-   but also:
       Links (e.g. URI)
       Topic markers (e.g. hashtags)
       Meta-information

What kind of meta-information?

    User profile (including home location)
    Images
    Messages replied to
    Message language

    Time of message
    Location of message
What resources do we have now?


Large, content-rich, linked, digital streams of human communication

We transfer knowledge via communication

Sampling communication gives a sample of human knowledge


          ''You've only done that which you can communicate''


The metadata (time – place – imagery) gives a richer resource:


      → A sampling of human behaviour
What can we do with this resource?
Context increases the data's richness

Increased richness enables novel applications

Time and Place are interesting parts of message context




1.What kinds of applications are there?

2.What are the practical challenges?
Temporal Context
Messages have timestamps:




                                    +
Two temporal retrieval scenarios:

      1. Historical analyses

      2. Emerging data
Historical search
Ability to retrieve from archives: Longitudinal query mode 0

Retrieve information on:

      ●   Lifecycle of socially connected groups

      ●   Analyse precursors to events, post-hoc




                       2008                                                      2011

0. Weikum et al. 2011: ''Longitudinal analytics on web archive data: It’s about time'', Proc. CIDR
Historical search
Retrospective analyses into cause and effect




                                     ''There's a dead crow
                                         in my garden''



Social media mentions of dead crows predict WNV in humans 1




1. Sugumaran & Voss 2012: ''Real-time spatio-temporal analysis of West Nile Virus using Twitter Data'', Proc.
Int'l conference on Computing for Geospatial Research and Applications
Emerging search
Data emerging at high velocity:

      185 000 documents per minute

Gives a high temporal density




Search over this info enables:

      ●   Live coverage of events

      ●
          Realtime identification of emerging events 2



2. Cohen at al. 2011: ''Computational journalism: A call to arms to database researchers'', Proc. CIDR
Temporal indexing
What are our requirements?

   ●   High-frequency document creation

   ●   Temporal cross-sections of varying size

   ●   Time-sensitive TF/IDF: stopwords are fluid



How can we do this? - Open challenge

   ●   Tree indexing hard to distribute

   ●   Maybe with adaptive multi-resolution grids?
Spatial Context
Demand for spatial information:

      20% of all Google searches

      53% of Bing mobile searches

Heterogeneous spatial context sources

      GPS locations (most reliable)

      Origin bounding boxes (e.g. city)

      User profile text??? 3

      Author's friends' locations 4

3. Hecht at al. 2011: ''Tweets from Justin Bieber’s Heart: The Dynamics of the “Location” Field in User
Profiles'', Proc. ACM CHI ;       4. Rout et al. 2013: ''Where's @wally? A Graph Based Method for Geolocating
Users in Social Networks'', Proc. ACM Hypertext
Spatial Keyword Search
How can we query a set of social media messages?

   Treat as a a set of objects, each having
      Text           
      Location       

   Query parameters:
     Query text
     Query location

Given query and set of messages, rank by similarity:

   Text similarity (Cosine, Siamese Learning Net, Oriented PCA)
   Separating distance (Haversine, Manhattan, Eco-routed)
   Blend this with balancing coeff 


   (just like conventional spatial keyword search)
Spatial Keyword Search
Query:                                                  E
  ''good bar in north copenhagen''
                                                                  B
Issued from location 

Five candidate messages                                 A               C

Query region established
                                                                            D
Rank by blend of location and textual similarity

           Message                                          loca text
       A   So drunk last night at @BarSyv                   0.7       0.6
       B   Out shoe shopping!!! #louboutintime              0.9       0.0
       C   Who pays $9 for a beer?!                         0.6       0.5
       D   wow found cph's greatest cocktail bar lol        0.1       1.0
       E   Traffic. Traffic everywhere. Need a drink.       0.4       0.2
Continuous Spatial Queries
Social media scenario characterised by:

   Streaming data

   New spatial objects constantly appearing

Two new spatial keyword query types:

   Static Continuous (SCSKQ)
      - Fixed query location
      - Tracks newly appearing objects

   Moving Continuous (MCSKQ)
     - Query location transits locus
     - Result updated with new objects

Novel part: fresh objects continuously introduced
Location Diversity
Location data unreliable

Reliability of location data... is also unreliable

''There are known knowns.. we also know there are known unknowns..
            but there are also unknown unknowns'' – Donald Rumsfeld

Text mentions require disambiguation


   ●   In profile
   ●   In messages
   ●   In queries




Requirement is to rank vague points given vague query
Willingness to travel
Determines useful search radius

Based on mode of transport:
                   14.9km
                        22.0km
                                 40.6km
                                          61.5km
                                            >100km

Different for varying classes of Point Of Interest?


ST Social media = huge dataset

   Easy data collection

   Useful for e.g. town planning
Spatio-temporal Challenges
We've seen temporal and spatial challenges; let's combine!

Given all these spatio-temporal utterances, what can we do?

   - Spatial gives relevance from physical or travel proximity

   - Temporal gives relevance from recency and historical



Adding text to the spatio-temporal points gives


             explicit semantic context


Not only are ST patterns in the data, we are told what they mean!
Topic-based Retrieval
Retrieving results on a topic is useful; ''Tell me about X''

Specific terms vary between places and over time



2007                                                               England English



en.wikipedia.org/wiki/President_of_the_United_States   ''Jelly''



2011                                                                  US English




    … Spatio-temporally sensitive indexing?
Sentiment Monitoring
Measure how attitudes change over time and over location

Business uses:      where to send marketing

Political uses:     data-driven democratic.. campaigning

Governance uses: what are citizen priorities in a region

Temporal dimension enables tracking of trends and reactions



                                  red = upbeat;

                                  blue = complaint.

                                  - no normalisation for vocality!
Local Computational Journalism
Social media is quick

Social media is uncurated

''Citizen Journalism''


News has relevance scope:
  Recency
  Proximity


Different events relevant in different contexts:
    Rain in London
    Rain in Addis Ababa

Automatic event detection5 - and also reporting!
5. Ritter at al. 2012: 'Open domain event extraction from Twitter'', Proc. ACM SIGKDD
Summary

Social media is a rich source of ''big data''

A small sampling of all human discourse

It comes with temporal and spatial context


Context-aware search and analysis is very demanding!

   - Novel, powerful applications

   - Wide variety of domains

   - An open set of challenges
Thank you!


Thank you for listening!

   Do you have any questions?

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

Introduction to Social Media in Asia
Introduction to Social Media in AsiaIntroduction to Social Media in Asia
Introduction to Social Media in Asia
 
Surrounded By Genius: Practical Advice On Creative Leadership
Surrounded By Genius: Practical Advice On Creative LeadershipSurrounded By Genius: Practical Advice On Creative Leadership
Surrounded By Genius: Practical Advice On Creative Leadership
 
Media Research - Research Hypothesis
Media Research- Research HypothesisMedia Research- Research Hypothesis
Media Research - Research Hypothesis
 
The Conversation - An Introduction to Social Media
The Conversation - An Introduction to Social MediaThe Conversation - An Introduction to Social Media
The Conversation - An Introduction to Social Media
 
Social Media Measurement
Social Media MeasurementSocial Media Measurement
Social Media Measurement
 
Introduction to Social Media
Introduction to Social MediaIntroduction to Social Media
Introduction to Social Media
 

Ähnlich wie Towards Context-Aware Search and Analysis on Social Media Data

Ähnlich wie Towards Context-Aware Search and Analysis on Social Media Data (20)

Phd Colloquium Spatial Analysis
Phd Colloquium Spatial AnalysisPhd Colloquium Spatial Analysis
Phd Colloquium Spatial Analysis
 
Rogers digitalmethodsaftersocialmedia nov2013_optimized_
Rogers digitalmethodsaftersocialmedia nov2013_optimized_Rogers digitalmethodsaftersocialmedia nov2013_optimized_
Rogers digitalmethodsaftersocialmedia nov2013_optimized_
 
From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?
 
ICAME 2010
ICAME 2010ICAME 2010
ICAME 2010
 
Augmenting offical datasets with volunteered geographic information a case ...
Augmenting offical datasets with volunteered geographic information   a case ...Augmenting offical datasets with volunteered geographic information   a case ...
Augmenting offical datasets with volunteered geographic information a case ...
 
Geographic Information Management Transformation
Geographic Information Management TransformationGeographic Information Management Transformation
Geographic Information Management Transformation
 
ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2
 
Real World Internet, Smart Cities and Linked Data: Mirko Presser (Alexandrea ...
Real World Internet, Smart Cities and Linked Data: Mirko Presser (Alexandrea ...Real World Internet, Smart Cities and Linked Data: Mirko Presser (Alexandrea ...
Real World Internet, Smart Cities and Linked Data: Mirko Presser (Alexandrea ...
 
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
 
Open Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and WebOpen Grid Forum workshop on Social Networks, Semantic Grids and Web
Open Grid Forum workshop on Social Networks, Semantic Grids and Web
 
APLIC 2014 - Social Observatories Coordinating Network
APLIC 2014 - Social Observatories Coordinating NetworkAPLIC 2014 - Social Observatories Coordinating Network
APLIC 2014 - Social Observatories Coordinating Network
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentation
 
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie RasmussenWords and More Words: Challenges of Big Data by Prof. Edie Rasmussen
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen
 
Big Data Challenges and Trust Management at CTS -2016
Big Data Challenges and Trust Management at CTS -2016Big Data Challenges and Trust Management at CTS -2016
Big Data Challenges and Trust Management at CTS -2016
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
How to utilize ‘big data’ on SNS for academic purpose?
How to utilize ‘big data’ on SNS  for academic purpose?How to utilize ‘big data’ on SNS  for academic purpose?
How to utilize ‘big data’ on SNS for academic purpose?
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
History of hci
History of hciHistory of hci
History of hci
 
Digital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social SciencesDigital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social Sciences
 
Digital Methods by Richard Rogers
Digital Methods by Richard RogersDigital Methods by Richard Rogers
Digital Methods by Richard Rogers
 

Mehr von Leon Derczynski

Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
Leon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
Leon Derczynski
 

Mehr von Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 

Kürzlich hochgeladen

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Kürzlich hochgeladen (20)

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

Towards Context-Aware Search and Analysis on Social Media Data

  • 1. Towards Context-Aware Search and Analysis on Social Media Data Leon Derczynski Bin Yang 杨彬 Christian S. Jensen
  • 2. Evolution of communication Functional utterances Vowels Velar closure: consonants Speech New modality: writing Increased Digital text machine- ? E-mail readable Social media information
  • 3. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  • 4. What is machine-readable now? Messages now contain - not only linguistic content - but also: Links (e.g. URI) Topic markers (e.g. hashtags) Meta-information What kind of meta-information? User profile (including home location) Images Messages replied to Message language Time of message Location of message
  • 5. What resources do we have now? Large, content-rich, linked, digital streams of human communication We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: → A sampling of human behaviour
  • 6. What can we do with this resource? Context increases the data's richness Increased richness enables novel applications Time and Place are interesting parts of message context 1.What kinds of applications are there? 2.What are the practical challenges?
  • 7. Temporal Context Messages have timestamps: + Two temporal retrieval scenarios: 1. Historical analyses 2. Emerging data
  • 8. Historical search Ability to retrieve from archives: Longitudinal query mode 0 Retrieve information on: ● Lifecycle of socially connected groups ● Analyse precursors to events, post-hoc 2008 2011 0. Weikum et al. 2011: ''Longitudinal analytics on web archive data: It’s about time'', Proc. CIDR
  • 9. Historical search Retrospective analyses into cause and effect ''There's a dead crow in my garden'' Social media mentions of dead crows predict WNV in humans 1 1. Sugumaran & Voss 2012: ''Real-time spatio-temporal analysis of West Nile Virus using Twitter Data'', Proc. Int'l conference on Computing for Geospatial Research and Applications
  • 10. Emerging search Data emerging at high velocity: 185 000 documents per minute Gives a high temporal density Search over this info enables: ● Live coverage of events ● Realtime identification of emerging events 2 2. Cohen at al. 2011: ''Computational journalism: A call to arms to database researchers'', Proc. CIDR
  • 11. Temporal indexing What are our requirements? ● High-frequency document creation ● Temporal cross-sections of varying size ● Time-sensitive TF/IDF: stopwords are fluid How can we do this? - Open challenge ● Tree indexing hard to distribute ● Maybe with adaptive multi-resolution grids?
  • 12. Spatial Context Demand for spatial information: 20% of all Google searches 53% of Bing mobile searches Heterogeneous spatial context sources GPS locations (most reliable) Origin bounding boxes (e.g. city) User profile text??? 3 Author's friends' locations 4 3. Hecht at al. 2011: ''Tweets from Justin Bieber’s Heart: The Dynamics of the “Location” Field in User Profiles'', Proc. ACM CHI ; 4. Rout et al. 2013: ''Where's @wally? A Graph Based Method for Geolocating Users in Social Networks'', Proc. ACM Hypertext
  • 13. Spatial Keyword Search How can we query a set of social media messages? Treat as a a set of objects, each having Text  Location  Query parameters: Query text Query location Given query and set of messages, rank by similarity: Text similarity (Cosine, Siamese Learning Net, Oriented PCA) Separating distance (Haversine, Manhattan, Eco-routed) Blend this with balancing coeff  (just like conventional spatial keyword search)
  • 14. Spatial Keyword Search Query: E ''good bar in north copenhagen'' B Issued from location  Five candidate messages A C Query region established D Rank by blend of location and textual similarity Message loca text A So drunk last night at @BarSyv 0.7 0.6 B Out shoe shopping!!! #louboutintime 0.9 0.0 C Who pays $9 for a beer?! 0.6 0.5 D wow found cph's greatest cocktail bar lol 0.1 1.0 E Traffic. Traffic everywhere. Need a drink. 0.4 0.2
  • 15. Continuous Spatial Queries Social media scenario characterised by: Streaming data New spatial objects constantly appearing Two new spatial keyword query types: Static Continuous (SCSKQ) - Fixed query location - Tracks newly appearing objects Moving Continuous (MCSKQ) - Query location transits locus - Result updated with new objects Novel part: fresh objects continuously introduced
  • 16. Location Diversity Location data unreliable Reliability of location data... is also unreliable ''There are known knowns.. we also know there are known unknowns.. but there are also unknown unknowns'' – Donald Rumsfeld Text mentions require disambiguation ● In profile ● In messages ● In queries Requirement is to rank vague points given vague query
  • 17. Willingness to travel Determines useful search radius Based on mode of transport: 14.9km 22.0km 40.6km 61.5km >100km Different for varying classes of Point Of Interest? ST Social media = huge dataset Easy data collection Useful for e.g. town planning
  • 18. Spatio-temporal Challenges We've seen temporal and spatial challenges; let's combine! Given all these spatio-temporal utterances, what can we do? - Spatial gives relevance from physical or travel proximity - Temporal gives relevance from recency and historical Adding text to the spatio-temporal points gives explicit semantic context Not only are ST patterns in the data, we are told what they mean!
  • 19. Topic-based Retrieval Retrieving results on a topic is useful; ''Tell me about X'' Specific terms vary between places and over time 2007 England English en.wikipedia.org/wiki/President_of_the_United_States ''Jelly'' 2011 US English … Spatio-temporally sensitive indexing?
  • 20. Sentiment Monitoring Measure how attitudes change over time and over location Business uses: where to send marketing Political uses: data-driven democratic.. campaigning Governance uses: what are citizen priorities in a region Temporal dimension enables tracking of trends and reactions red = upbeat; blue = complaint. - no normalisation for vocality!
  • 21. Local Computational Journalism Social media is quick Social media is uncurated ''Citizen Journalism'' News has relevance scope: Recency Proximity Different events relevant in different contexts: Rain in London Rain in Addis Ababa Automatic event detection5 - and also reporting! 5. Ritter at al. 2012: 'Open domain event extraction from Twitter'', Proc. ACM SIGKDD
  • 22. Summary Social media is a rich source of ''big data'' A small sampling of all human discourse It comes with temporal and spatial context Context-aware search and analysis is very demanding! - Novel, powerful applications - Wide variety of domains - An open set of challenges
  • 23. Thank you! Thank you for listening! Do you have any questions?