DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses
1. Enhancing Twitter Data Analysis with Simple Semantic
Filtering: Example in Tracking Influenza-Like Illnesses
Son Doan1, Lucila Ohno-Machado1, Nigel Collier2
1Division of Biomedical Informatics, University of California San Diego
2National Institute of Informatics, Japan
IEEE HISB 2012
UCSD, La Jolla, CA Sep 27-28, 2012
2. Time
Sentinel PCP Field Laboratory
Rumors
networks reports workers reports
Certainty
Twitter>
Twitter> “I’m sick with a
“Ahh! Really bad Twitter> chest infection”
throat.” “Still getting worse.
Staying at home News report>
News report> temp is up to 39.5.” “Mystery illness
“Influenza starts causes concern.”
early this year.”
3. Social media in event tracking
• Event tracking/predicting:
– Predict election, gasoline price: O’Connor et al. (2010)
– Predict stock market: Bollen et al. (2011)
– Earthquake warning: Sasaki et al. (2010), Guy et al. (2010)
– Public mood tracking: Golder and Macy (2011), Doan and Collier
(2011)
• Predicting the Influenza-Like Illness rate:
– Google Flu Trends: Ginsberg et al. (2009), Valdivia et al. (2010), now
extended to dengue tracking (Chan et al. (2012)) used query
logs, but the query data is closed
– Culotta (2009), Lampos and Christinini (2010), Signorini et al.
(2011), Chew and Eysenbach (2011), Doan et al. (2012) used
Twitter
4. Twitter characteristics
• Twitter posts (tweets) are limited to 140 characters
– High use of abbreviations and aliases
– Dynamic lexicon of semantic tags (hashtags)
• Very high volume of data: Generate 430 million tweets per day
• High numbers of users: Over 500 active million users
• Meta data: Geo-tagging, time stamping, user profile
• Event reports sometimes ahead of newswire, e.g. Iranian
presidential protests, swine flu outbreak reports from CDC, deaths of
famous people (Petrovic et al. 2010)
5. Twitter corpus
Timeline: 36 weeks for the US 2009 influenza season (Aug 30, 2009 to May
8, 2010), ‘Gardenhose’ data sampling method (~5% sampling rate from the
whole data)
Name Total
25 mil
20 mil
Tweets 587,290,394
15 mil
Users 23,571,765
10 mil
URL 136,034,309 5 mil
Hash 96,399,587
Tags
Thanks to Brendan O’Connor (CMU) and Twitter Inc.
6. Existing methods: empirical approach for predicting
the ILI rate
Case definition from CDC
ILI-related
Twitter tweets Influenza-like Illness (ILI) =
corpus fever (> 100o F)* AND
ILI-related
cough and/or sore throat
(in the absence of a known
keywords filtering
cause other than influenza)
*Temperature can be measured in
Culotta4 Signorini3 Chew3 the office or at home
flu swine h1n1
cough flu swine flu
Every year:
headache influenza swineflu 3~5 million severe illness
250 000 – 500 000 deaths
sore throat (WHO 2009)
Gold standard from laboratory data reported by the US Outpatient Influenza-Like Illness
Surveillance Network (ILINet) (CDC)
8. Knowledge-based approach
If the tweeter is referring to someone else‘s
symptom then filter out. Only retain if the tweeter
is referring to their own symptoms.
Name Example
Syndrome only tweets containing syndrome Barber just coughed
keywords on me in the chair.
Syndrome + “flu” tweets containing syndrome I got flu n coughed a
keywords and “flu” lot.
Syndrome + “flu” - tweets containing syndrome 7-year-old boy dies of
URL keywords and “flu”, remove flu,pneumonia < URL>
links
10. Extract syndrome-related keywords from BioCaster
ontology
We extracted keywords only from respiratory syndrome
achy chest cold symptom respiratory failure
apnea cough runny nose
asthma dyspnea short of breath
asthmatic dyspnoea shortness of breath
37
blocked nose gasping for air sinusitis respiratory
breathing difficulties lung sounds sore throat syndrome
keywords
breathing trouble pneumonia stop breathing
bronchitis rales stuffy nose
… … …
11. Semantic level filtering
Name Examples
Negation Remove negation in tweets I don’t have flu
Emoticon Remove tweets containing Glad to hear that you’re beating the flu.
smiley emoticons, e.g., :-),,:D :-) Hope you don’t get the nasty cough
that everyone’s getting this year
HashTags Keeps tweets containing Still coughing smh #swineflu #h1n1
keyword “flu”
Humor Remove humor features in Hm Im kinda wanting to go to NYC really
tweets, e.g., “haha”,”hihi”, soon ***cough … cough*** @Ctmomofsix
“***cough … cough***” =)
Geo Tweets from graphical
locations (e.g., US)
12. Detecting negation in Twitter
Semantic tags
Example
Rule A: If VBZ is followed by XX then that sentence is negative
15. Semantic-level filtered tweets
Types Tweet samples
Influenza confirmation I got flu n coughed a lot. Now my voice is like
monster’s voice. Rrr
Influenza symptoms My day: flu-like symptoms (headache, body aches,
cough, chills, 100.9 fever). Swine flu not ruled out.
#H1N1
Flu shots I’m still getting flu shots, nothing is worth flu turning
into bronchitis into pneumonia
Self protection Cover your mouth if coughing, use a tissue, wash
your hands often & get a flu shot - protect and
defend your community from #H1N1
Medication Wondering why I didn’t take the flu shot, laying in
bed with cough drops, medicine, and the remote
16. Challenges
• Technical issues:
– Data sampling: only ~5% sampling rate
• Semantic issues:
– Metaphoric symptoms: Cabin fever setting in right now.
– Interrogative sentences: wonder how long u get off work with
swine flu?
– Hypothetical sentences: I can ignore this sore throat no longer.
And, um, maybe I should have gotten that H1N1 vaccine.
– Others: Too much lemonade. My throat is burning.
17. Summary
• We proposed a general and extendable approach for tweet
filtering based on an ontology of infectious diseases
(BioCaster Ontology)
– This methodology can be applied to other languages, e.g., Spanish,
Japanese
• Our best results showed significantly improvement in
comparison to state-of-the-art keyword filtering methods
• Using simple semantic filtering in Twitter can improve
correlation with CDC data
18. DIZIE: system for syndromic surveillance on Twitter
http://born.nii.ac.jp/dizie
/
Gastrointestinal
Respiratory
Neurological
40 main world Dermatological
Haemorrhagic
cities
Musculoskeletal
Collier and Doan. eHealth 2012;186-95
19. Acknowledgements
• Assoc. Prof. Wendy W. Chapman, PhD, DBMI, UCSD
• Mike Conway, PhD, DBMI, UCSD
• Grant-in-aid funding from the National Institute of
Informatics, Japan
Hinweis der Redaktion
Having timely and well informed information helps governments to take the right actions to reduce the length and severity of an infectious disease outbreak. This information is important not only for pandemic influenza but also for many other diseases such as measles and mumps as well as more exotic diseases like chikungunya. Governments in advanced countries like Japan have access to many sources of information within their own country borders. These range from the very reliable like laboratory reports to statistics about how many drugs are being sold. However the quickest source of information is often rumours. These can be individual messages published on Web sites like Twitter or news reports published in the media.
Twitter is an example of a microblogging service. Users post messages (tweets) up to 140 characters in length. This enables them to post personal information on-the-go from mobile SMS devices where ever they happen to be. Hand in hand with the short messaging style is a highly abbreviated form of vocabulary. We often see special abbreviations and semantic tags called Hashtags that are developed on the fly to describe new concepts such as H1N1 influenza. Volumes also tend to be very high. Although official statistics are hard to find the Twitter developer’s conference mentioned 106 million users in 2010 and the BBC mentioned over 200 million users in 2011. Although this is a fraction of the total world population it still might be possible to use Twitter messages for alerting in major cities where the are a high density of users.
Talk here about the difficult cases – how they are classified and how we might overcome them in the future.