After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

After the Boom No One Tweets:
Microblog-based Influenza Detection
Incorporating Indirect Information
Shoko Wakamiya1, Yukiko Kawai2, Eiji Aramaki1
1 Nara Institute of Science and Technology, Japan
2 Kyoto Sangyo University, Japan
Oct. 18, 2016
Twitter

Exploiting Tweeting User as Social Sensor
[Sakaki2010, Lee2011, Aramaki2011]
• Various real-world phenomena can be observed
EX) Disasters, local events, infectious diseases, etc.
• It is expected to outperform other traditional
methods of medical
reporting means
Sakaki et al.: Earthquake Shakes Twitter Users. WWW (2010)
Lee, Wakamiya, Sumiya: Discovery of Unusual Regional Social Activities using Geo-tagged Microblogs, World Wide Web
Special Issue on Mobile Services on the Web (2011)
Aramaki et al.: Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter, EMNLP (2011)
Target event
Physical Sensor-based
Social Sensor-based
Previous Proposed
Sensors Direct information Indirect informat
Direct informati
Physical sensor Social sensor

Related work on
Twitter-based Influenza Surveillance
Target (# of areas) Data size (million tweets)
Aramaki [16] Japan (1 area) 300
Achrekar [27] US (10 areas) 1.9 *
Culotta [28] US (1 area) 0.5
Kanouch [29] Japan (1 area) 300
De Quincy [30] Europe (1 area) 0.14
Doan [31] US (1 area) 24 *
Szomszor [32] Europe (1 area) 3
• Lots of Twitter-based disease detection/
prediction have been developed
• Most of the systems performed low-resolution
geographic analysis (country-level)

Problem (1):
Imbalance of Social Sensor
Distribution
• Most of the social sensors are in urban cities
(Tokyo, Osaka, etc.)
• Other cities are affected by a shortage of data
Sapporo,
Hokkaido
Tokyo
Geographic distribution of
influenza-related tweets in Japan

Problem (2):
Gap between Social Sensors and
Patients
Relation between numbers of influenza-related
tweets and patients in each prefecture
• Except for a few high-population cities, most areas
have fewer tweets
• Some such areas have numerous influenza patients
0
500
1000
1500
2000
2500
3000
3500
4000
0
50000
100000
150000
200000
250000
300000
TOKYO
AREA13
OSAKA
AREA27
KANAGAWA
AREA14
CHIBA
AREA12
AICHI
AREA23
SAITAMA
AREA11
HOKKAIDO
AREA1
HYOGO
AREA28
KYOTO
AREA26
FUKUOKA
AREA40
SHIZUOKA
AREA22
MIYAGI
AREA4
IBARAKI
AREA8
NIIGATA
AREA15
FUKUSHIMA
AREA7
GUNMA
AREA10
HIROSHIMA
AREA34
FUKUI
AREA20
GIFU
AREA21
KUMAMOTO
AREA43
SHIGA
AREA25
TOCHIGI
AREA9
MIE
AREA24
NARA
AREA29
IWATE
AREA3
OKAYAMA
AREA33
KAGOSHIMA
AREA46
WAKAYAMA
AREA30
OKINAWA
AREA47
YAMAGUCHI
AREA35
YAMAGATA
AREA6
KAGAWA
AREA37
MIYAZAKI
AREA45
ISHIKAWA
AREA19
AOMORI
AREA2
EHIME
AREA38
NAGANO
AREA17
OITA
AREA44
TOKUSHIMA
AREA36
NAGASAKI
AREA42
AKITA
AREA5
YAMANASHI
AREA16
TOTTORI
AREA31
KOCHI
AREA39
SAGA
AREA41
TOYAMA
AREA18
SHIMANE
AREA32
# of patients
# of tweets
#ofpatients
#oftweets
Prefectures (area)

Exploiting Indirect Info.
Pro) Covering wider areas
Con)
• Unreliability (too noisy or too old)
(1) My grandma in Kyoto is in bed with flu
(2) NEWS: classes in Osaka have been closed because of the flu
• Complex pattern
When?
Already spread
Target event
Physical Sensor-based
Social Sensor-based
Previous Proposed
Sensors Direct information Indirect information
Direct information
Existing Proposed
The amount of tweets
containing direct info.
containing indirect info.
The amount of patients

Our Goal & Approach
To estimate the number of patients in each area based on
the relation between human motivation to tweet and
information propagation
• h1) People prefer reporting new info., and that they are
insensitive to already-propagated info.
• h2) The degree of propagation (popularity) is correlated with
the amount of indirect info.
containing direct info.
containing indirect info.
The amount of patients
(a) Before Epidemics (b) After Epidemics
Positive Negative Trapped
Sensor
Indirect Information
Direct Information Direct Information
Trapped
sensors

Outline
• Background
• Goal and approach
• Construction of Twitter-based Influenza
Surveillance System
• Experimental evaluation
• Discussion
• Conclusions

Twitter-based Influenza Surveillance
LOCATION
DETECTION
MODULE
AGGREGATION
MODULE
LINEAR MODEL
TRAP MODEL
Positive
Negative
Trash
P/N Classifier
Tweets
GPS Info.
Profile Info.
Indirect
Info.
Available
No
No
NLP
MODULE
# of flu patients
Direct
Information
Indirect
Information
No
1. NLP-based Classification
Patient (positive) or not (negative)
2. Location Detection
Direct info. or Indirect info.
3. Data Aggregation
Linear model or Trap model

1. NLP-based Classification
To Judge whether a given tweet is written by
a patient or not
• Building the training set
A human annotator assigned one of two labels
(positive/negative) to 1,000 influenza-related tweets
Ex)
• Classifying the test set
• SVM-based classifier
• Bag-of-words representation
• Polynomial kernel (d=2)
“My mother got flu today” positive
“I got influenza shot today” negative

To cover wider areas by
extracting indirect info.
as well as direct info.
• Direct info.
• GPS info. (GPS)
• Profile info. (PROF)
• Indirect info. (IND)
Location names in tweets’ contents extracted using
a list of prefecture names and famous landmarks
Ex) “My friend in Osaka caught flu”
2. Location Detection
GPS (0.5%)
PROF (26.2%)
IND (4.7%)No location
info.
Percentage of tweets with
direct/indirect info.
7,666,201 tweets

3. Data Aggregation
To estimate the amount of patients using different
types of info; direct info. and indirect info.
i) LINEAR Model
A simple model to sum up direct info. and indirect info.
ii) TRAP Model
A model based on LINEAR model and human’s nature to tweet
𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$: ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)
B∈&
The number of patients 𝐼"#$%&' 𝑎, 𝑡 in area 𝑎 at day 𝑡:
𝐼D'&. 𝑎, 𝑡 =
𝐼"#$%&' 𝑎, 𝑡
𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1)
, 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)
P
QRS
The number of patients 𝐼D'&. 𝑎, 𝑡 in area 𝑎 at day 𝑡:

Concept of TRAP Model
1. People prefer a new event, and are insensitive to an
already propagated event
2. The degree of propagation (popularity) is correlated
with the amount of indirect info.
(a) Before epidemics (a) Before Epidemics (b) After Epidemics
(b) After epidemics
Positive Negative Trapped
Sensor
(a) (b)
People actively
report the flu
Most of the people
lose interest to share
direct info.

3. Data Aggregation
To estimate the amount of patients using different
types of info; direct info. and indirect info.
i) LINEAR Model
A simple model to sum up direct info. and indirect info.
ii) TRAP Model
A model based on LINEAR model and human’s nature to tweet
𝐼"#$%&' 𝑎, 𝑡 = 𝑤-./ 0 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑤.'56 0 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + 𝑤#$: ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)
B∈&
The number of patients 𝐼"#$%&' 𝑎, 𝑡 in area 𝑎 at day 𝑡:
𝐼D'&. 𝑎, 𝑡 =
𝐼"#$%&' 𝑎, 𝑡
𝑤F/%'/ 0 𝑁G − 𝑤D'&. 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1)
, 𝑝𝑜𝑝 𝑎, 𝑡 = ; 𝐼𝑁𝐷(𝑎, 𝑐)
P
QRS
The number of patients 𝐼D'&. 𝑎, 𝑡 in area 𝑎 at day 𝑡:
The degree of
Information propagation
in area a during t days
The amount of
trapped sensors
The amount of
social sensors in area a

Experimental Datasets
• Tweet data
• A collection of tweets
containing the keyword
“I-N-FU-RU”
• Gold standard data
• The number of patients per week
for every prefecture (47 areas)
• The data is available from the
Infectious Disease Surveillance
Center (IDSC)
ALL
Duration 2012/08/02-2016/01/03
# of tweets (Size) 7,666,201 (2.275 GB)
SEASON2012
Duration 2012/11/01-2013/05/31
# of tweets (Size) 1,959,610 (729.4 MB)
SEASON2013
Duration 2013/11/01-2014/05/31
# of tweets (Size) 501,542 (143.7 MB)*
SEASON2014
Duration 2014/11/01-2015/05/31
# of tweets (Size) 2,736,685 (808.2 MB)
A sample of the weekly report from IDSC
http://www.nih.go.jp/niid/ja/diseases/a/flu.html

• Methods
BASELINE, BASELINE+PROF, LINEAR, TRAP
• Evaluation metric
Pearson correlation coefficient
(high: |r|>0.7, medium: 0.4<|r|≤0.7, low: |r|≤0.4)
Experiments
Method NLP GPS PROF IND
TRAP
TRAP+NLP ✓ ✓ ✓ ✓
TRAP ✓ ✓ ✓
LINEAR
LINEAR+NLP ✓ ✓ ✓ ✓
LINEAR ✓ ✓ ✓
BASRLINE
+PROF
BASELINE+PROF+NLP (EMNLP2011) ✓ ✓ ✓
BASELINE+PROF ✓ ✓
BASELINE
BASELINE +NLP ✓ ✓
BASELINE ✓
𝐼T&/% 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 𝐼T&/%U.'56 𝑎, 𝑡 = 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡
𝐼"#$%&' 𝑎, 𝑡
= 𝐺𝑃𝑆 𝑎, 𝑡 + 𝑃𝑅𝑂𝐹 𝑎, 𝑡 + ; 𝐼𝑁𝐷(𝑎, 𝑏, 𝑡)
B∈&
𝐼D'&. 𝑎, 𝑡 =
𝐼"#$%&' 𝑎, 𝑡
0.05 0 𝑁G − 0.2 0 log( 𝑝𝑜𝑝 𝑎, 𝑡 + 1)

Results (1/3)
Contribution of NLP-based
Classification
• TRAP+NLP (r=0.70) is higher than TRAP (r=0.64)
• NLP classification in this domain (flu) is not hard
Target Method
SEASON
2012
SEASON
2013
SEASON
2014
SEASON
TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70
LINEAR+NLP 0.70 0.55 0.53 0.50
EMNLP2011 0.74 0.68 0.67 0.69
BASELINE+NLP 0.33 0.37 0.48 0.36
High
population
areas
(Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75
LINEAR+NLP 0.78 0.65 0.64 0.64
EMNLP2011 0.80 0.77 0.71 0.75
BASELINE+NLP 0.55 0.60 0.63 0.53
Low
population
areas
(Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69
LINEAR+NLP 0.62 0.46 0.48 0.43
EMNLP2011 0.70 0.61 0.65 0.64
BASELINE+NLP 0.21 0.26 0.35 0.25
Target Method
SEASON
2012
SEASON
2013
SEASON
2014
SEASON
TOTAL
All areas
TRAP 0.72 0.63 0.64 0.64
LINEAR 0.65 0.48 0.53 0.48
BASELINE+PROF 0.69 0.59 0.66 0.64
BASELINE 0.29 0.34 0.48 0.35
High
population
areas
(Top 10)
TRAP 0.75 0.69 0.70 0.70
LINEAR 0.72 0.60 0.63 0.61
BASELINE+PROF 0.75 0.69 0.70 0.70
BASELINE 0.44 0.56 0.63 0.50
Low
population
areas
(Top 10)
TRAP 0.71 0.61 0.53 0.57
LINEAR 0.58 0.41 0.46 0.40
BASELINE+PROF 0.65 0.52 0.65 0.59
BASELINE 0.20 0.23 0.35 0.25
(a) With NLP-based classification (b) Without NLP-based classification

Results (2/3)
Contribution of Indirect Info. in
LINEAR Model
• LINEAR+NLP (r=0.50) is lower than
BASELINE+PROF+NLP (r=0.69)
• It is difficult to detect influenza epidemics by
adding indirect info. in a naïve manner
Target Method
SEASON
2012
SEASON
2013
SEASON
2014
SEASON
TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70
LINEAR+NLP 0.70 0.55 0.53 0.50
EMNLP2011 0.74 0.68 0.67 0.69
BASELINE+NLP 0.33 0.37 0.48 0.36
High
population
areas
(Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75
LINEAR+NLP 0.78 0.65 0.64 0.64
EMNLP2011 0.80 0.77 0.71 0.75
BASELINE+NLP 0.55 0.60 0.63 0.53
Low
population
areas
(Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69
LINEAR+NLP 0.62 0.46 0.48 0.43
EMNLP2011 0.70 0.61 0.65 0.64
BASELINE+NLP 0.21 0.26 0.35 0.25
(a) With NLP-based classification

Results (3/3)
Contribution of Indirect Info. in
TRAP Model
• TRAP+NLP achieved the best performance
(r=0.70)
• TRAP model effectively contributes to
exploitation of both direct and indirect info.
Target Method
SEASON
2012
SEASON
2013
SEASON
2014
SEASON
TOTAL
All areas
TRAP+NLP 0.76 0.70 0.69 0.70
LINEAR+NLP 0.70 0.55 0.53 0.50
EMNLP2011 0.74 0.68 0.67 0.69
BASELINE+NLP 0.33 0.37 0.48 0.36
High
population
areas
(Top 10)
TRAP+NLP 0.80 0.77 0.72 0.75
LINEAR+NLP 0.78 0.65 0.64 0.64
EMNLP2011 0.80 0.77 0.71 0.75
BASELINE+NLP 0.55 0.60 0.63 0.53
Low
population
areas
(Top 10)
TRAP+NLP 0.75 0.66 0.71 0.69
LINEAR+NLP 0.62 0.46 0.48 0.43
EMNLP2011 0.70 0.61 0.65 0.64
BASELINE+NLP 0.21 0.26 0.35 0.25
(a) With NLP-based classification

Discussion:
Relation between Volume of Tweets
and Performance (1/2)
High population areas
• TRAP+NLP was higher than EMNLP2011
• Top 17 high population areas exhibited high
correlation (r>0.7)
0
500
1000
1500
2000
2500
3000
3500
4000
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
# of tweets TRAP+NLP EMNLP2011
#oftweets
Correlation
coefficient
Prefectures (AREAs)
TOKYO (AREA13) OSAKA (AREA27)

Discussion:
Relation between Volume of Tweets
and Performance (2/2)
Other areas
• There is large variance of performance
• TRAP+NLP mostly outperforms EMNLP2011
0
500
1000
1500
2000
2500
3000
3500
4000
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
# of tweets TRAP+NLP EMNLP2011
#oftweets
Correlation
coefficient Prefectures (AREAs)
FUKUI (AREA20)AOMORI (AREA2)

Discussion:
After the Boom No One Tweets
• TRAP model outperformed the LINEAR model
If influenza becomes a hot topic, people do not talk
about it
• Similar phenomena were so far proposed from
a psychological viewpoint
Most studies showed rapid propagation of rumors
(especially bad news) and its short life
• This study attempts to handle human nature
using a statistical model
This model has sufficient room for application to
additional studies

Conclusions
• Twitter-based influenza surveillance
• Utilized indirect info. that mention other places for
covering wider area
• Developed TRAP model based on information
propagation and people’s motivation to tweet
•Future work
• To examine worldwide influenza surveillance
• To establish a novel method by integrating various
models for their accurate prediction
• To consider various effects related to geographic
relations among areas

After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information

Ähnlich wie After the Boom No One Tweets: Microblog-based Influenza Detection Incorporating Indirect Information (20)

Mehr von 奈良先端大情報科学研究科

Mehr von 奈良先端大情報科学研究科 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)