SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Towards a Data-driven Approach to Identify
Crisis-Related Topics in Social Media Streams
Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX)
Qatar Computing Research Institute
Doha, Qatar.
SWDM’15 : WWW’15 May 18th 2015
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Information Variability on Social Media
• Different events present different information
categories
• Even for recurring events, categories proportion
change
Different Classification Approaches
• Various classification approaches exist:
– Manual classification by human experts
– Automatic classification using unsupervised or
supervised approaches(needs training data)
– Hybrid: Automatic + Manual
• Retrospective vs. real-time classification
– Batch processing (offline, training data availability)
– Stream processing (real-time, scarce training data)
Real-time Stream Classification
(Supervised )
• Fewer categories are better
– Decrease workers dropout
– More training data for each category, more accuracy
– “7 plus/minus 2” rule [G. A. Miller, 56]
• Categories need to be defined carefully
– Empty categories (waste space and efforts of workers)
– Categories that are too large introduce heterogeneity
Problem Statement
• How can we classify items arriving as a data
stream into a small number of categories, if
we cannot anticipate exactly which will be the
most frequent categories?
Our research improves crowdsourcing-based and
supervised learning-based systems (e.g. AIDR) by
finding latent categories in fast data streams.
Our Approach (top-down + bottom-up)
1. An expert defines information categories (top-down)
2. Messages are categorized into the initial set plus an
extra “Miscellaneous” category
3. Identify relevant and prevalent categories from the
messages in the “Miscellaneous” category (bottom-
up)
1. Generate candidate categories
2. Learn characteristics of good categories
3. Rank categories on good characteristics
How do we identify relevant categories?
Candidate Generation
We propose to apply Latent Dirichlet Allocation
(LDA) on the Miscellaneous category:
• Input: A set of n documents (all messages in
the Misc. category) and a number m (# of
topics to be generated)
• Output: n x m matrix in which cell(i, j) indicates
the extent to which document i corresponds to
topic j.
Candidate Evaluation
To reduce the workload of experts to decide
which categories to pick or not, we propose the
following criteria:
• Volume: a category shouldn’t be too small
• Novelty: a category must not overlap or be
too similar to the existing categories
• Cohesiveness (intra- and inter-similarity): a
category should be cohesive (should have
small intra-topic and large inter-topic values)
Experimental Testing
• We used Twitter data of 17 crises (from the
CrisisLexT26 dataset at crisislex.org)
A. Affected individuals, deaths, injuries,
missing, found.
B. Infrastructure and utilities: buildings,
roads, services damage.
C. Donation and volunteering: needs,
requests of food, shelter, supplies.
D. Caution and advice: warnings issued
or lifted, guidance and tips.
E. Sympathy and emotional support:
thoughts, prayers, gratitude, etc.
Z. Other useful information not covered
by any of the above categories.
Candidate Generation Setup
• Applied LDA on the messages in the “Z”
category of each crisis
• 5 topics were generated for each crisis
• Considered messages with LDA score > 0.06 in
each topic
• Presented the LDA generated topics to experts
in random order
Candidate Annotation Setup
Recruited two experts from two Int. humanitarian
organizations in the crisis response domain
Results
• Topics with avg. score <= 2.5 considered as bad topics
• Topics with avg. score >= 3.5 considered as good topics
• Hit: if the metric value of good topics > bad topics
A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
Conclusion
• Novelty, intra-similarity and cohesiveness are
useful in identifying good topics
• Our approach combines top-down (manual)
and bottom-up (automatic) elements.
• Learned important characteristics of good
topics
• Future work includes candidate ranking
including recommendation for adding,
merging, dropping new unseen categories
Data used in this study can be requested:
Contact: Muhammad Imran at
mimran@qf.org.qa OR @mimran15
Thank you!
Authors contact:
Muhammad Imran @mimran15
Carlos Castillo @ChaToX

Weitere ähnliche Inhalte

Mehr von Muhammad Imran

Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
Muhammad Imran
 
ResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platformResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platform
Muhammad Imran
 

Mehr von Muhammad Imran (9)

Introduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster ResponseIntroduction to Machine Learning: An Application to Disaster Response
Introduction to Machine Learning: An Application to Disaster Response
 
Artificial Intelligence for Disaster Response
Artificial Intelligence for Disaster ResponseArtificial Intelligence for Disaster Response
Artificial Intelligence for Disaster Response
 
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Di...
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
Tweet4act: Using Incident-Specific Profiles for Classifying Crisis-Related Me...
 
Extracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social MediaExtracting Information Nuggets from Disaster-Related Messages in Social Media
Extracting Information Nuggets from Disaster-Related Messages in Social Media
 
Domain Specific Mashups
Domain Specific MashupsDomain Specific Mashups
Domain Specific Mashups
 
Reseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECOReseval Mashup Platform Talk at SECO
Reseval Mashup Platform Talk at SECO
 
ResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platformResEval: Resource-oriented Research Impact Evaluation platform
ResEval: Resource-oriented Research Impact Evaluation platform
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams

  • 1. Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX) Qatar Computing Research Institute Doha, Qatar. SWDM’15 : WWW’15 May 18th 2015
  • 2. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 3. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 4. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 5. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 6. Information Variability on Social Media • Different events present different information categories • Even for recurring events, categories proportion change
  • 7. Different Classification Approaches • Various classification approaches exist: – Manual classification by human experts – Automatic classification using unsupervised or supervised approaches(needs training data) – Hybrid: Automatic + Manual • Retrospective vs. real-time classification – Batch processing (offline, training data availability) – Stream processing (real-time, scarce training data)
  • 8. Real-time Stream Classification (Supervised ) • Fewer categories are better – Decrease workers dropout – More training data for each category, more accuracy – “7 plus/minus 2” rule [G. A. Miller, 56] • Categories need to be defined carefully – Empty categories (waste space and efforts of workers) – Categories that are too large introduce heterogeneity
  • 9. Problem Statement • How can we classify items arriving as a data stream into a small number of categories, if we cannot anticipate exactly which will be the most frequent categories? Our research improves crowdsourcing-based and supervised learning-based systems (e.g. AIDR) by finding latent categories in fast data streams.
  • 10. Our Approach (top-down + bottom-up) 1. An expert defines information categories (top-down) 2. Messages are categorized into the initial set plus an extra “Miscellaneous” category 3. Identify relevant and prevalent categories from the messages in the “Miscellaneous” category (bottom- up) 1. Generate candidate categories 2. Learn characteristics of good categories 3. Rank categories on good characteristics How do we identify relevant categories?
  • 11. Candidate Generation We propose to apply Latent Dirichlet Allocation (LDA) on the Miscellaneous category: • Input: A set of n documents (all messages in the Misc. category) and a number m (# of topics to be generated) • Output: n x m matrix in which cell(i, j) indicates the extent to which document i corresponds to topic j.
  • 12. Candidate Evaluation To reduce the workload of experts to decide which categories to pick or not, we propose the following criteria: • Volume: a category shouldn’t be too small • Novelty: a category must not overlap or be too similar to the existing categories • Cohesiveness (intra- and inter-similarity): a category should be cohesive (should have small intra-topic and large inter-topic values)
  • 13. Experimental Testing • We used Twitter data of 17 crises (from the CrisisLexT26 dataset at crisislex.org) A. Affected individuals, deaths, injuries, missing, found. B. Infrastructure and utilities: buildings, roads, services damage. C. Donation and volunteering: needs, requests of food, shelter, supplies. D. Caution and advice: warnings issued or lifted, guidance and tips. E. Sympathy and emotional support: thoughts, prayers, gratitude, etc. Z. Other useful information not covered by any of the above categories.
  • 14. Candidate Generation Setup • Applied LDA on the messages in the “Z” category of each crisis • 5 topics were generated for each crisis • Considered messages with LDA score > 0.06 in each topic • Presented the LDA generated topics to experts in random order
  • 15. Candidate Annotation Setup Recruited two experts from two Int. humanitarian organizations in the crisis response domain
  • 16. Results • Topics with avg. score <= 2.5 considered as bad topics • Topics with avg. score >= 3.5 considered as good topics • Hit: if the metric value of good topics > bad topics A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
  • 17. Conclusion • Novelty, intra-similarity and cohesiveness are useful in identifying good topics • Our approach combines top-down (manual) and bottom-up (automatic) elements. • Learned important characteristics of good topics • Future work includes candidate ranking including recommendation for adding, merging, dropping new unseen categories
  • 18. Data used in this study can be requested: Contact: Muhammad Imran at mimran@qf.org.qa OR @mimran15
  • 19. Thank you! Authors contact: Muhammad Imran @mimran15 Carlos Castillo @ChaToX